Search Based Software Engineering and Software Defect Prediction

Size: px
Start display at page:

Download "Search Based Software Engineering and Software Defect Prediction"

Transcription

1 1 Search Based Software Engineering and Software Defect Prediction Goran Mauša, mag. ing. el. University of Rijeka - Faculty of Engineering, Vukovarska 58, HR Rijeka, Croatia [email protected] Abstract This article presents an overview of search based software engineering (SBSE) and software defect prediction areas. SBSE consists of search-based optimization algorithms that are becoming incorporated in almost every area of software engineering. However, there are still software engineering areas that exploit its benefits less often, like software defect prediction. With large software systems nowadays present in ever increasing number of human activities, the conventional testing is becoming the most expensive part of software product lifecycle. Software defect prediction is an emerging field that tends to improve software quality and testing efficiency. Despite a small amount of research done in SBSE usage in software defect prediction, there are some encouraging and motivating studies that prove there is still much research to be done in this area. Index Terms Software engineering, optimization, defect prediction I. INTRODUCTION Search based software engineering (SBSE) is used in all sorts of optimization or multi-objective problems present in software engineering. Every study that compares problem solving with and without SBSE indicates an improvement in results when using SBSE. Software defect prediction is one of software engineering fields that is merely beginning to exploit its advantages. Considering the complexity of modern software products, it is obvious that faults are inevitable. Ideally, testing should be exhaustive but out of practical reasons that is impossible. Software defect prediction can assist in testing process allowing software engineers to focus development activities on fault-prone code, improving software quality and making better use of resource. This article presents an overview of search based software engineering and software defect prediction areas. Section II describes what SBSE is and what its main characteristics are. Scope of recent research of software engineering problems that benefit from using SBSE are given in section III. An overview of software defect prediction problem is given in section IV, while section V presents some of research done in using SBSE within software defect prediction. Finally, section VI gives the conclusion. II. SEARCH BASED SOFTWARE ENGINEERING The term Search based software engineering was created by Harman and Jones in SBSE consists of search-based optimization algorithms used in software engineering, with genetic algorithms, genetic programming, simulated annealing and hill climbing being the most widely used [1]. The search space with certain parameters we can manipulate in order to make different candidate solutions and the fitness function that we can measure as output of the problem being analyzed are the two requirements when using SBSE. Each algorithm employs some degree of randomness in finding the solution to a problem. The fitness function evaluates whether the algorithm has found a better solution than the one in previous step and guides the search to as optimal as possible solution [2]. Unlike other engineering disciplines where search based algorithms found application in, software engineering is the only discipline whose artifacts are solely virtual. The mere lack of possible simulations or models which can represent and optimize its artifacts, makes SBSE and used fitness function the closest thing to an artifact. This property makes SBSE very attractive and potentially beneficial field. Search based algorithms are attractive in software engineering also due to the fact that the data in software engineering are often inaccurate, overdispersed and incomplete, making some traditional optimization techniques inappropriate. The search based approach is very generic, with different definition of fitness functions for various objectives. The fitness function, therefore, makes the same overall search based optimization strategy applicable to very different scenarios. With its convergence to an optimal or near optimal solution, SBSE becomes of great value when there is a vast number of possible combinations of candidate solutions in the search space. The candidate solutions can be a vector of numbers, a graph structure, a tree or a set of rules. In the process of optimizing the candidate solution there are several steps: Initialize the search - usually with random choice among possible candidate solutions Assess the quality of the candidate solution computing its fitness function Modify the candidate solution making it randomly slightly different Select the candidate solution based on fitness according to the chosen algorithm There are many different SBSE algorithms and each

2 2 of them employs a unique approach to the 4 steps mentioned earlier. However, there are certain features that differentiate algorithms in categorizes of: Local or Global optimization algorithms Single-state or Population based methods Local algorithms tend to find local optimum in the search space and may become trapped there, while global search techniques overcome that problem. However the obvious reason for using global search technique may be there is still a trade-off between efficiency and effectiveness. This means that global search techniques have higher efficiency but at the cost of greater computational effort, making local search technique more adequate for simpler problems. Single-state methods have one candidate solution at the time, while population based methods have a sample of candidate solutions (a population). The population based methods stole the concept from biology and got the name Evolutionary Algorithms. That is why their modify step involves mutation and recombination of fittest parents with tendency to create even fitter children. Examples of local single-state algorithms are hill climbing, greedy algorithm and simulated annealing, while evolutionary algorithms like non-dominated sorting genetic algorithm (NSGA) and two-archive algorithm fall into the group of global population based algorithms. SBSE can be very helpful when dealing with multiobjective problems. In such cases, there is often a necessary trade-off between different objectives and there is no optimal solution to the problem. SBSE techniques tend to find a sample of non dominated solutions which create the Pareto optimality front. Since we need to satisfy opposite objectives, the non dominated solutions are the ones which cannot be compared among each other, but at the same time we know there are no better solutions. That is why we cannot find one optimal solution, but still we get an insight into the problem and the possible, near optimal solutions. This kind of problem is often present in software engineering and SBSE provides a very good support in experts decision making process. III. SEARCH BASED SOFTWARE ENGINEERING USAGE SBSE is a widespread group of optimization techniques that found usage in almost every stage of software product lifecycle. However, there are still software engineering areas that exploit its benefits much more often than others, like software testing [3]. Here are some applications of SBSE in most recent studies: Solving the next release problem [3] [5] Automatic software repair problem [6] [11] Test data optimization problem [12] [16] Generating higher order mutants [17], [18] Other multi-objective problems in software engineering [19], [20] A. Next Release Problem The next release problem is a multi-objective problem which requirements should appear in the next release of a software product. The task of SBSE is to choose a small subset among all possible requirements in order to satisfy as many stake-holders or customers as possible and to minimize the cost at the same time. Solving the next release problem is difficult and often has so many combinations of solutions that becomes practically impossible to solve manually. Using SBSE does not give a specific answer to that problem but instead gives a range of equally good, near optimal solutions. That provides a valuable and necessary support for the decision maker. Zhang et al. [4] used search-based optimization techniques to automate the search for optimal or near optimal allocations of requirements that balance competing stake holder objectives in next release problem. NSGA- II and two-archive algorithm were compared in performance and the two-archive algorithm proved to the better one. Durillo et al. [3] analyzed which features to include in the next release of product in order to satisfy as many customers as possible with minimal cost using 3 multi-objective metaheuristics. NSGA-II was used as a reference algorithm in the field of multiobjective optimization, MOCell because it outperformed NSGA-II in several studies and PAES as one of the simplest techniques. The results showed that NSGA-II finds the highest number of optimal solutions, MOCell finds the widest range of different solutions and PAES was fastest but worst, as expected. Finkelstein et al. [5] made a proposition that each notion of fairness in next release problem should form an objective in multi-objective, pareto optimal SBSE setting. Comparing the NSGA-II and the two-archive algorithm, the results showed that they performed equally well with random data sets, while NSGA-II performed better with real data set obtained from Motorola. B. Automatic Software Repair Fixing bugs is a difficult and time-consuming manual process and some reports say software maintenance usually consumes about 90% of all costs after delivery of a product. An efficient, fully automated technique for repairing program defects could, therefore, alleviate that heavy burden. The idea of using SBSE in automatic software repair is actually rather simple but efficient. First task is to locate the region of the program relevant to an error. An SBSE algorithm is then used to produce simple changes along the path where the fault lies, trying to eliminate the fault and maintain the functionality of original program at the same time. Weimer et al. [6] used genetic programming for software repair phase in fully automated method for locating and repairing bugs in software. Fast et al. [7] used genetic programming in automated program repairing as well, but their focus was in improving fitness-function which resulted in efficiency improvement of 81%. Forrest et al.

3 3 [8] combined genetic programming with program analysis methods to repair bugs in the off-the-shelf legacy C programs and managed to repair all of 11 programs they used. Schulte et al. [9] explored the advantages of assembly-level repair using evolutionary computing in automated program repair problem. Having all the advantages of assembly level approach over source code level approach, they obtained nearly as efficient results as at source code level. Nguyen et al. [10] successfully used genetic programming approach to automate repairing program bugs in existing software with average running time from half a second to ten minutes. Weimer et al. [11] also successfully used genetic programming in automatic bug repair in off-the-shelf legacy C programs combining program analysis methods with evolutionary computing. Their approach requires 1428 seconds and 3903 fitness evaluations per constructing a repair on average. All of these promising results indicate SBSE being the proper choice for automated software repair. C. Test Data Optimization Test data generation, regeneration and minimization are three different approaches to test data optimization problem. They tend to identify near optimal test sets in reasonable time using Evolutionary testing, a sub-field of SBSE techniques, often called Search Based Testing. Test data generation begins from scratch, regeneration uses the pre-existing test data as a starting point and test suite minimization tends to identify and remove redundant test cases. The problem how to test something that reacts in different manner to the same test input over time can be found when dealing with autonomous agents. Nguyen et al. [12] proposed a solution to that problem and used evolutionary optimization to generate demanding test cases for such autonomous agents that produce different output to the same input, due to increasing knowledge for example. NSGA-II algorithm as a fast multi-objective genetic algorithm that in other studies proved to be better in finding widely spread solutions and with better convergence to the optimum compared to other algorithms was used in their study. Harman and McMinn [13] analyzed which type of search is best for structural test data generation problem. They used evolutionary testing as a global search algorithm, hill climbing as a local and a hybrid memetic algorithm for that purpose. Their suspicion proved to be correct and the results showed that hybrid approach, in terms of coverage, is capable of best overall performance. McMinn et al. [14] also compared evolutionary testing as a global search algorithm, hill climbing as a local search and a hybrid memetic algorithm in search based structural test data generation problem, but with the improvement of irrelevant input variable removal (INVR). The results expectedly showed that all search algorithms are more successful and cover more branches with INVR and that memetic algorithm is the most prolific technique at successfully finding significance with and without INVR. There are many testing scenarios in which a tester may already have some pre-existing test cases. They could have been created by tester, based on his experience, expertise and domain knowledge, by developer or they may be present from regression testing of previous version of product. In order to exploit the effort and knowledge put together to form this test cases, Yoo and Harman [15] proposed using pre-existing test data as a starting point in search-based test data generation, making it a regeneration process. They used hill climbing without random initialization, but with randomfirst-ascent and pre-existing test data. The results were promising, indicating the proposed approach can be up to two orders of magnitude more efficient, achieve higher structural coverage and equal component level of mutation score with much lower cost compared to a state-of-the-art search based testing technique. Test suite minimization can prove to be very important in strict time limits, often given in regression testing. Regression testing has to guarantee that the recent changes in a program do not interfere with its functionality. Due to ever growing test suite, it is prohibitively expensive to execute the entire test suite. Yoo and Harman [16] analyzed the hybrid algorithm that combines the efficient approximation of the greedy algorithm with the capability of population based genetic algorithm for pareto efficient multi-objective test suite minimization. They found greedy algorithm may provide a good approximation of pareto front in smaller software products, but for larger products, the usage of HNSGA-II is suggested due to more precise test suite minimization. D. Higher Order Mutation Generation It is said that 90% of faults which survived the testing procedure have to be complex ones. In order to reduce their number, we need to learn more about them. Higher order mutants (HOMs) are deliberately faulty programs used in software testing process. The order of mutant reflects the number of injected faults into the original program. Finding higher order mutants that create subtle and complex faults and sometimes practically mask one another helps us locating this hazardous combination of otherwise harmless faults when they exist separately. Such HOMs are more difficult to find with simple test cases and there lies a possible usage of HOMs - to find better test cases. Jia and Harman [17] compared the performance of 3 algorithms for finding optimal HOMs: greedy algorithm, genetic algorithm and hill climbing algorithm. The results showed that genetic algorithm performed best because it finds most subsuming HOMs, hill climbing always finds the highest fitness HOMs and greedy algorithm finds the highest order HOMs. What makes their findings questionable is the fact that random search found more HOMs that greedy algorithm and hill climbing. Langdon et al. [18] explored the usage of multiobjective pareto optimal approach with Monte Carlo

4 4 sampling, NSGA-II genetic algorithm and genetic programming to search for higher order mutants which are both harder to kill and more realistic. They found their higher order genetic programming mutation testing approach able to find even such simple faults that in combination mask each other and therefore form complex faults very difficult to detect. E. Other Multi-Objective Problems in Software Engineering Besides next release problem, there are many more problems which involve incomparable and often opposite multiple objectives in software engineering. A well known and highly important and challenging problem in software engineering is the one that requires high degree of cohesion and low degree of coupling for a good module structure. This problem is even more intensified as software evolves and its modular structure tends to degrade. Praditwong et al. [19] compared automated techniques for suggesting software clustering, delimiting boundaries between modules that maximize cohesion and minimize coupling. They used hill climbing as a single-objective algorithm and two-archive algorithm as a multi-objective with 2 different approaches: the maximizing cluster approach (MCA) and equal-size cluster approach (ECA). The results indicated that ECA is superior to MCA and hill climbing in this task. Many other multi-objective, but completely different problems in software engineering can be found in software engineering management. Project managers often do not understand the complex optimization techniques and they do not need to, in order to benefit from them. It is important to provide them with tool they can easily give input to and to obtain visually acceptable output that can help them in making important decisions. One such problem was analyzed by Di Penta et al. [20] where they used 3 metaheuristics in staff and task allocation with the objective to minimize the completion time and reduce schedule fragmentation. They compared the performance of NSGA-II, stochastic hill climbing and simulated annealing as most widely used SBSE techniques that appear in 80% publications. The results were compared in single objective optimization approach where simulated annealing was the best algorithm. IV. DEFECT PREDICTION Software defect prediction is an emerging field in software engineering that tends to improve software quality and testing efficiency. Arguments in favor of this field come mostly from great amount of time and human resources spent on locating and fixing faulty software modules. The usage of defect prediction does not offer an optimal allocation of resources. Instead, it gives valuable information to software managers which parts of software code are more likely to be fault prone and, therefore, where it would be wise to concentrate their testing resources. Fig. 1. Defect Prediction Process Figure 1 presents the stages of defect prediction process. Each of these stages can be done automatically so the whole process does not consume much resource in software product life cycle. After collecting and preparing data for defect prediction task, presented in upper row in figure 1, there are two key steps. First key step is prediction model building with learning set. The task of learning set is to present the input and the desired output to the model so the model can adjust accordingly. The second key step is to evaluate the model s predicting capabilities with testing set. The testing set allows us to see how accurately the model performs with unseen data comparing the output the model offers with the expected output. A. Data Collection Software defect prediction refers to identifying error prone software modules data mining the static code attributes. Besides static code attributes that include certain code and design measures, some researchers use additional attributes that include process and personnel details. Due to a simple and cheap procedure of extracting static code attributes, even for large systems, other attributes are used rarely [21]. There are some publicly available data sets intended for defect prediction research and the NASA PROMISE data sets are particularly popular. B. Data Preprocessing Data preprocessing is an important step in data mining. It includes procedures like feature selection among independent variables and outlier removal, which are not always included in studies. Feature selection process is usually performed using stepwise selection procedure [22], correlation analysis, grey relational analysis [23] or principal component analysis [24]. Both feature selection and outlier removal processes are used in order to improve defect prediction models performances.

5 5 C. Data Spliting Every defect prediction model has to be taught how to perform classification. On the other hand, it also has to be tested how well it performs with unseen data. Proper evaluation of performance can be done when we know the correct output of unseen data and that is why data sets are split into learning and testing data sets. For larger data sets the data splitting process can be traditional or crossvalidation. Traditional process usually splits the data randomly into learning and testing set in ratio 67%:23%. Improved traditional process repeats the process of randomly splitting data, usually a 100 times. Crossvalidation process, on the other hand, splits the data randomly into several groups of equal size and chooses one as testing set. Then it permutes all combinations with the same groups, each time different group assigning to testing set. If it splits the data into 3 groups, that it has the same ratio as traditional process and it is called threefold crossvalidation. Extensive testing showed that splitting the data into 10 groups (tenfold crossvalidation) leads to even better estimate of prediction model [25]. There are also data splitting processes for smaller data sets like leave-one-out or Bootstrap. Leave-one-out process performs the testing on only one sample, repeating the process for all samples. Bootstrap process randomly chooses samples into learning set, but with the possibility of repeatingly choosing the same samples. The unchosen samples are then assigned to testing set. There is also the possibility to repeat the Bootstrap process iteratively and it is usually preferred over leave-one-out. D. Prediction Model There is a great number of data mining and classification models and algorithms that are used in software defect prediction. As mentioned in previous subsections, each model uses the static code attributes as input and gives the presence of defect or number of defects as output. Most popular and state-of-the-art algorithms may be divided into statistical classifiers, nearest-neighbor methods, neural networks, support vector machine-based classifiers, decision tree-based approaches and ensemble methods [26]. Here is a more thorough list of algorithms: 1) Statistical classifiers Linear Discriminant Analysis Quadratic Discriminant Analysis Logistic Regression Naive Bayes Bayesian Networks Least-Angle Regression Relevance Vector Machine 2) Nearest neighbor methods k-nearest Neighbor K-Star 3) Neural Networks Multi-Layer Perceptron Radial Basis Function Network 4) Support vector machine-based classifiers Support Vector Machine Lagrangian SVM Least Squares SVM Linear Programming Voted Perceptron 5) Decision tree approaches C 4.5 Decision Tree Classification and Regression Tree Alternating Decision Tree 6) Ensemble methods Random Forest Logistic Model Tree E. Evaluation The final step in software defect prediction is the evaluation of prediction model s capabilities. According to the output of prediction model, there is a different set of evaluation metrics available. Most defect prediction models classify software modules into fault-prone and non-fault-prone. With such binary classification, researchers most often begin evaluation with counting the number of correctly and incorrectly predicted modules and placing them into confusion matrix. The confusion matrix provides four scores. A true positive (TP) score and a true negative (TN) score are counted for every correctly classified faultprone module and non-fault-prone module, respectively. A false positive (FP) score and false negative (FN) score are counted for every misclassified non-fault-prone module and fault-prone module, respectively. Using these four scores it is possible to calculate several evaluation measures. Here are the most often used ones: Accuracy (number of correctly classified modules divided by total number of modules): ACC = T P + T N T P + F P + T N + F N The true positive rate (TPR), often referred to as recall or sensitivity (number of correctly classified fault-prone modules divided by total number of fault-prone modules): T P R = T P F N + T P The false positive rate (FPR), also known as false alarm rate or fallout (number of modules misclassified as fault-prone divided by total number of nonfault-prone modules): F P R = F P T N + F P A more general measure of evaluation is area under ROC curve (AUC). It is often used when comparing the performance of different prediction models. ROC (receiver operating characteristic) curve, originally from (1) (2) (3)

6 6 signal detection problem, graphically presents the tradeoff between TPR and FPR. AUC is merely the value of integrated ROC curve that spans from 0 to 1. Some defect prediction models predict the number of defects. Such numeric output require different evaluation measures. If we define p 1, p 2,... p n as predicted values and a 1, a 2,... a n as actual values we can compute following measures of evaluation: Mean-squared error: MSE = (p i a i ) 2 (4) n Root mean-squared error: RMSE = MSE (5) Mean-absolute error: MAE = (p i a i ) (6) n Relative-squared error: RSE = (p i a i ) 2 n (a (7) i a) 2 Relative-absolute error: RAE = (p i a i ) (a i a) The choice of appropriate measures depends entirely on given situation and the aim of research. V. SEARCH BASED SOFTWARE ENGINEERING IN SOFTWARE DEFECT PREDICTION (8) As demonstrated in previous sections, SBSE proved to be a very effective tool when dealing with various problems present in software engineering. For the purpose of this study, in order to find as much as possible of the research there had been done within the area of SBSE usage in software defect prediction, a review of literature was performed. Two systematic reviews were used as motivating examples how to conduct a thorough exploration of literature. A systematic review of literature performed by Beecham et al. [27] showed the way to explore the software defect prediction area. The papers that passed all their rigorous quality measures were included in our scope of research as well. Another systematic review, performed by Ali et al. [28] showed the way to explore the SBSE area. This study has put together the search terms from [27] and [28] and even expanded them. Both systematic reviews reported most relevant papers to be found in IEEE Xplore and ACM Digital Library databases. Thus, the following search term was looked for in abstracts in IEEE Xplore and ACM Digital Library: ("fault prediction" OR "fault forecast" OR "defect prediction" OR "defect forecast" OR "bug prediction" OR "bug forecast" OR "failure prediction" OR "failure forecast" OR "error prediction" OR "error forecast") AND ("metaheuristic" OR "metaheuristic" OR "search based" OR "search-based" OR "genetic algorithm" OR "evolutionary" OR "hill climbing" OR "simulated annealing" OR "ant colony"). A surprisingly low number of only 17 papers were found with this search, proving this area is yet to be explored. Though a small number of paper had been found with conducted search, some encouraging research topics were found. Podgorelec [29] wanted to improve the performance of defect prediction models constructing an outlier filtering method. His goal was to achieve more reliable results training the classifier with filtered data. The outlier filtering method was based on evolutionary induced decision trees. The idea for locating outliers was to look for data cases that produce opposite decision by accurate classifiers. The results indicated that all the decision tree based classifiers (AREX, ID3, C4.5) improved accuracy on training and testing set, Naive-Bayes classifier improved only on training set and the IB and logistic regression classifiers showed some improvement on testing set. In any predicting task, the use of irrelevant variables can degrade the model s performance. Pendharkar [30] proposed two hybrid exhaustive search and probabilistic neural networks (ES-PNN) and simulated annealing and probabilistic neural network (SA-PNN) models for selecting the variables that give best prediction accuracy. Using two real-world software engineering data sets, the results showed that hybrid algorithms outperform standard machine learning methods. Considering the factorial complexity, exhaustive search is not practical for large number of features. One examined data set has 8 code attributes. The difference between exhaustive search and simulated annealing approach in selecting the optimal set of predicting features for that data set was that exhaustive search found all eleven optimal combinations of features and simulated annealing found only one. The other examined data set has 22 code attributes. In order to perform exhaustive search on that data set, in given computational environment, a period of about 120 years would be required. This study presents a very pragmatic example of SBSE values. Interesting research was done by Hochman et al. [31]. In the prediction model building phase they used evolutionary enhanced artificial neural networks (ENN). The evolutionary algorithm was used to obtain the optimal configuration of artificial neural network for the task of defect prediction. They compared ENN algorithm with discriminant analysis and found ENN to be robust and of superior performance, suggesting the algorithm should be taken into account in other software engineering areas as well. A similar idea was found in study performed by Benaddy et al. [32] where they used genetic algorithm to enhance neural networks and regression model learning. Their results also proved that implementing genetic algorithm leads to better results than classical learning methods.

7 7 VI. CONCLUSION Software defect prediction is one of software engineering areas that rarely benefits from search based software engineering. The potential of using SBSE is vast and still needs to be explored more thoroughly. There are hybrid algorithms emerging within SBSE that provide promising results in other software engineering areas. SBSE algorithms found application even in data preprocessing stage of software defect prediction. There are also many algorithms within the prediction model building stage that could be used in combination with SBSE algorithms. To sum up, both SBSE and software defect prediction are relatively novel software engineering areas. Usage of SBSE in software defect prediction is even less explored. That is why they offer great opportunities for further research. REFERENCES [1] Mark Harman and Afshin Mansouri. Search based software engineering: Introduction to the special issue of the ieee transactions on software engineering. 36(6): , [2] Sean Luke. Essentials of Metaheuristics. Lulu, Available for free at sean/book/metaheuristics/. [3] Juan J. Durillo, Yuanyuan Zhang, Enrique Alba, Mark Harman, and Antonio J. Nebro. A study of the bi-objective next release problem. Empirical Software Engineering, 16(1):29 60, February [4] Yuanyuan Zhang, Mark Harman, Anthony Finkelstein, and S. Afshin Mansouri. Comparing the performance of metaheuristics for the analysis of multi-stakeholder tradeoffs in requirements optimisation. Information and Software Technology, 53(7): , July [5] Anthony Finkelstein, Mark Harman, S. Afshin Mansouri, Jian Ren, and Yuanyuan Zhang. A search based approach to fairness analysis in requirement assignments to aid negotiation, mediation and decision making. Requir. Eng., 14: , October [6] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. Automatically finding patches using genetic programming. In Proceedings of the 31st International Conference on Software Engineering, ICSE 09, pages , Washington, DC, USA, IEEE Computer Society. [7] Ethan Fast, Claire Le Goues, Stephanie Forrest, and Westley Weimer. Designing better fitness functions for automated program repair. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, GECCO 10, pages , New York, NY, USA, ACM. [8] Stephanie Forrest, ThanhVu Nguyen, Westley Weimer, and Claire Le Goues. A genetic programming approach to automated software repair. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO 09, pages , New York, NY, USA, ACM. [9] Eric Schulte, Stephanie Forrest, and Westley Weimer. Automated program repair through the evolution of assembly code. In Proceedings of the IEEE/ACM international conference on Automated software engineering, ASE 10, pages , New York, NY, USA, ACM. [10] ThanhVu Nguyen, Westley Weimer, Claire Le Goues, and Stephanie Forrest. Using execution paths to evolve software patches. In Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation Workshops, ICSTW 09, pages , Washington, DC, USA, IEEE Computer Society. [11] Westley Weimer, Stephanie Forrest, Claire Le Goues, and ThanhVu Nguyen. Automatic program repair with evolutionary computation. Commun. ACM, 53: , May [12] Cu D. Nguyen, Anna Perini, Paolo Tonella, Simon Miles, Mark Harman, and Michael Luck. Evolutionary testing of autonomous software agents. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 1, AAMAS 09, pages , Richland, SC, International Foundation for Autonomous Agents and Multiagent Systems. [13] Mark Harman and Phil McMinn. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. IEEE Trans. Softw. Eng., 36: , March [14] Phil McMinn, Mark Harman, Kiran Lakhotia, Youssef Hassoun, and Joachim Wegener. Input domain reduction through irrelevant variable removal and its effect on local, global and hybrid searchbased structural test data generation. IEEE Transactions on Software Engineering, [15] Shin Yoo and Mark Harman. Test data regeneration: Generating new test data from existing test data. Journal of Software Testing, Verification and Reliability, To appear. [16] Shin Yoo and Mark Harman. Using hybrid algorithm for pareto efficient multi-objective test suite minimisation. J. Syst. Softw., 83: , April [17] Yue Jia and Mark Harman. Higher order mutation testing. Inf. Softw. Technol., 51: , October [18] William B. Langdon, Mark Harman, and Yue Jia. Efficient multiobjective higher order mutation testing with genetic programming. J. Syst. Softw., 83: , December [19] Kata Praditwong, Mark Harman, and Xin Yao. Software module clustering as a multi-objective search problem. IEEE Trans. Softw. Eng., 37: , March [20] Massimiliano Di Penta, Mark Harman, and Giuliano Antoniol. The use of search-based optimization techniques to schedule and staff software projects: an approach and an empirical study. Softw. Pract. Exper., 41: , April [21] Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe Bener. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engg., 17: , December [22] V. Porter L. Brian, J. Daly and J. Wüst. Predicting fault-prone classes with design measures in object-oriented systems. In Proceedings of the The Ninth International Symposium on Software Reliability Engineering, pages 334, Washington, DC, USA, IEEE Computer Society. [23] Yunfeng Luo, Kerong Ben, and Lei Mi. Software metrics reduction for fault-proneness prediction of software modules. In Proceedings of the 2010 IFIP international conference on Network and parallel computing, NPC 10, pages , Berlin, Heidelberg, Springer- Verlag. [24] Lionel C. Briand, John W. Daly, Victor Porter, and Jürgen Wüst. A comprehensive empirical validation of product measures for object-oriented systems, [25] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Amsterdam, 3. edition, [26] Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Softw. Eng., 34: , July [27] S. Beecham, T. Hall, D. Bowes, D. Gray, S. Counsell, and S. Black. A systematic review of fault prediction approaches used in software engineering. Technical Report Lero-TR , Lero, [28] Shaukat Ali, Lionel C. Briand, Hadi Hemmati, and Rajwinder Kaur Panesar-Walawege. A systematic review of the application and empirical investigation of search-based test case generation. IEEE Trans. Softw. Eng., 36: , November [29] Vili Podgorelec. Improved mining of software complexity data on evolutionary filtered training sets. WSEAS Trans. Info. Sci. and App., 6: , November [30] Parag C. Pendharkar. Exhaustive and heuristic search approaches for learning a software defect prediction model. Eng. Appl. Artif. Intell., 23:34 40, February [31] R. Hochman, J. P. Hudepohl, E. B. Allen, and T. M. Khoshgoftaar. Evolutionary neural networks: A robust approach to software reliability problems. In Proceedings of the Eighth International Symposium on Software Reliability Engineering, ISSRE 97, pages 13, Washington, DC, USA, IEEE Computer Society. [32] Mohamed Benaddy, Sultan Aljahdali, and Mohamed Wakrim. Evolutionary prediction for cumulative failure modeling: A comparative study. In Proceedings of the 2011 Eighth International Conference on Information Technology: New Generations, ITNG 11, pages 41 47, Washington, DC, USA, IEEE Computer Society.

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Search Algorithm in Software Testing and Debugging

Search Algorithm in Software Testing and Debugging Search Algorithm in Software Testing and Debugging Hsueh-Chien Cheng Dec 8, 2010 Search Algorithm Search algorithm is a well-studied field in AI Computer chess Hill climbing A search... Evolutionary Algorithm

More information

Data Collection from Open Source Software Repositories

Data Collection from Open Source Software Repositories Data Collection from Open Source Software Repositories GORAN MAUŠA, TIHANA GALINAC GRBAC SEIP LABORATORY FACULTY OF ENGINEERING UNIVERSITY OF RIJEKA, CROATIA Software Defect Prediction (SDP) Aim: Focus

More information

Optimised Realistic Test Input Generation

Optimised Realistic Test Input Generation Optimised Realistic Test Input Generation Mustafa Bozkurt and Mark Harman {m.bozkurt,m.harman}@cs.ucl.ac.uk CREST Centre, Department of Computer Science, University College London. Malet Place, London

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University [email protected] Taghi M. Khoshgoftaar

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Fuzzy Cognitive Map for Software Testing Using Artificial Intelligence Techniques

Fuzzy Cognitive Map for Software Testing Using Artificial Intelligence Techniques Fuzzy ognitive Map for Software Testing Using Artificial Intelligence Techniques Deane Larkman 1, Masoud Mohammadian 1, Bala Balachandran 1, Ric Jentzsch 2 1 Faculty of Information Science and Engineering,

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

How To Predict Web Site Visits

How To Predict Web Site Visits Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

More information

Coverage Criteria for Search Based Automatic Unit Testing of Java Programs

Coverage Criteria for Search Based Automatic Unit Testing of Java Programs ISSN (Online): 2409-4285 www.ijcsse.org Page: 256-263 Coverage Criteria for Search Based Automatic Unit Testing of Java Programs Ina Papadhopulli 1 and Elinda Meçe 2 1, 2 Department of Computer Engineering,

More information

Software Defect Prediction Modeling

Software Defect Prediction Modeling Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University [email protected] Abstract Defect predictors are helpful tools for project managers and developers.

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Class Imbalance Learning in Software Defect Prediction

Class Imbalance Learning in Software Defect Prediction Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang [email protected] University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Search Based Software Engineering: Techniques, Taxonomy, Tutorial

Search Based Software Engineering: Techniques, Taxonomy, Tutorial Search Based Software Engineering: Techniques, Taxonomy, Tutorial Mark Harman 1, Phil McMinn 2, Jerffeson Teixeira de Souza 3, and Shin Yoo 1 1 University College London, UK 2 University of Sheffield,

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

About the Author. The Role of Artificial Intelligence in Software Engineering. Brief History of AI. Introduction 2/27/2013

About the Author. The Role of Artificial Intelligence in Software Engineering. Brief History of AI. Introduction 2/27/2013 About the Author The Role of Artificial Intelligence in Software Engineering By: Mark Harman Presented by: Jacob Lear Mark Harman is a Professor of Software Engineering at University College London Director

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

GA as a Data Optimization Tool for Predictive Analytics

GA as a Data Optimization Tool for Predictive Analytics GA as a Data Optimization Tool for Predictive Analytics Chandra.J 1, Dr.Nachamai.M 2,Dr.Anitha.S.Pillai 3 1Assistant Professor, Department of computer Science, Christ University, Bangalore,India, [email protected]

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Multi-objective Approaches to Optimal Testing Resource Allocation in Modular Software Systems

Multi-objective Approaches to Optimal Testing Resource Allocation in Modular Software Systems Multi-objective Approaches to Optimal Testing Resource Allocation in Modular Software Systems Zai Wang 1, Ke Tang 1 and Xin Yao 1,2 1 Nature Inspired Computation and Applications Laboratory (NICAL), School

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent

More information

Confirmation Bias as a Human Aspect in Software Engineering

Confirmation Bias as a Human Aspect in Software Engineering Confirmation Bias as a Human Aspect in Software Engineering Gul Calikli, PhD Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson University Why Human Aspects in Software

More information

Intelligent and Automated Software Testing Methods Classification

Intelligent and Automated Software Testing Methods Classification Intelligent and Automated Software Testing Methods Classification Seyed Reza Shahamiri Department of Software Engineering Faculty of Computer Science and Information s University Teknologi Malaysia (UTM)

More information

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

More information

SEARCH-BASED SOFTWARE TEST DATA GENERATION USING EVOLUTIONARY COMPUTATION

SEARCH-BASED SOFTWARE TEST DATA GENERATION USING EVOLUTIONARY COMPUTATION SEARCH-BASED SOFTWARE TEST DATA GENERATION USING EVOLUTIONARY COMPUTATION P. Maragathavalli 1 1 Department of Information Technology, Pondicherry Engineering College, Puducherry [email protected] ABSTRACT

More information

Software Defect Prediction for Quality Improvement Using Hybrid Approach

Software Defect Prediction for Quality Improvement Using Hybrid Approach Software Defect Prediction for Quality Improvement Using Hybrid Approach 1 Pooja Paramshetti, 2 D. A. Phalke D.Y. Patil College of Engineering, Akurdi, Pune. Savitribai Phule Pune University ABSTRACT In

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise

More information

Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Comparison of machine learning methods for intelligent tutoring systems

Comparison of machine learning methods for intelligent tutoring systems Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu

More information

Performance Based Evaluation of New Software Testing Using Artificial Neural Network

Performance Based Evaluation of New Software Testing Using Artificial Neural Network Performance Based Evaluation of New Software Testing Using Artificial Neural Network Jogi John 1, Mangesh Wanjari 2 1 Priyadarshini College of Engineering, Nagpur, Maharashtra, India 2 Shri Ramdeobaba

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Processing and data collection of program structures in open source repositories

Processing and data collection of program structures in open source repositories 1 Processing and data collection of program structures in open source repositories JEAN PETRIĆ, TIHANA GALINAC GRBAC AND MARIO DUBRAVAC, University of Rijeka Software structure analysis with help of network

More information

PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING

PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING Jelber Sayyad Shirabad Lionel C. Briand, Yvan Labiche, Zaheer Bawar Presented By : Faezeh R.Sadeghi Overview Introduction

More information

A New Multi-objective Evolutionary Optimisation Algorithm: The Two-Archive Algorithm

A New Multi-objective Evolutionary Optimisation Algorithm: The Two-Archive Algorithm A New Multi-objective Evolutionary Optimisation Algorithm: The Two-Archive Algorithm Kata Praditwong 1 and Xin Yao 2 The Centre of Excellence for Research in Computational Intelligence and Applications(CERCIA),

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Automatic Stress and Load Testing for Embedded Systems

Automatic Stress and Load Testing for Embedded Systems Automatic Stress and Load Testing for Embedded Systems Mohamad S. Bayan João W. Cangussu Department of Computer Science University of Texas at Dallas msb021000,cangussu @utdallas.edu Abstract Load and

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Two Flavors in Automated Software Repair: Rigid Repair and Plastic Repair

Two Flavors in Automated Software Repair: Rigid Repair and Plastic Repair Two Flavors in Automated Software Repair: Rigid Repair and Plastic Repair Martin Monperrus, Benoit Baudry Dagstuhl Preprint, Seminar #13061, 2013. Link to the latest version Abstract In this paper, we

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Management of Software Projects with GAs

Management of Software Projects with GAs MIC05: The Sixth Metaheuristics International Conference 1152-1 Management of Software Projects with GAs Enrique Alba J. Francisco Chicano Departamento de Lenguajes y Ciencias de la Computación, Universidad

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

A Hybrid Approach of Module Sequence Generation using Neural Network for Software Architecture

A Hybrid Approach of Module Sequence Generation using Neural Network for Software Architecture A Hybrid Approach of Module Sequence Generation using Neural Network for Software Architecture Manjot Kalsi 1, Janpreet Singh 2 1 M.Tech Research Scholar, Lovely Professional University, Phagwara, Punjab,

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016 Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification

Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification Adriel Cheng Cheng-Chew Lim The University of Adelaide, Australia 5005 Abstract We propose a test generation method employing

More information

Performance Evaluation Metrics for Software Fault Prediction Studies

Performance Evaluation Metrics for Software Fault Prediction Studies Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy

More information

Data Mining & Data Stream Mining Open Source Tools

Data Mining & Data Stream Mining Open Source Tools Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Lund, November 16, 2015. Tihana Galinac Grbac University of Rijeka

Lund, November 16, 2015. Tihana Galinac Grbac University of Rijeka Lund, November 16, 2015. Tihana Galinac Grbac University of Rijeka Motivation New development trends (IoT, service compositions) Quality of Service/Experience Demands Software (Development) Technologies

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Web Application Regression Testing: A Session Based Test Case Prioritization Approach

Web Application Regression Testing: A Session Based Test Case Prioritization Approach Web Application Regression Testing: A Session Based Test Case Prioritization Approach Mojtaba Raeisi Nejad Dobuneh 1, Dayang Norhayati Abang Jawawi 2, Mohammad V. Malakooti 3 Faculty and Head of Department

More information