Data Mining & Data Stream Mining Open Source Tools



Similar documents
How To Classify Data Stream Data From Concept Drift To Novel Class

A comparative study of data mining (DM) and massive data mining (MDM)

How To Classify Data Stream Mining

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

INCREMENTAL AGGREGATION MODEL FOR DATA STREAM CLASSIFICATION

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Introduction Predictive Analytics Tools: Weka

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Spam Detection Using Customized SimHash Function

8. Machine Learning Applied Artificial Intelligence

The Prophecy-Prototype of Prediction modeling tool

Web Document Clustering

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

Hadoop Technology for Flow Analysis of the Internet Traffic

An Introduction to WEKA. As presented by PACE

Grid Density Clustering Algorithm

Data Mining on Streams

COURSE RECOMMENDER SYSTEM IN E-LEARNING

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Massive Online Analysis Manual

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Handling Big Data Stream Analytics using SAMOA Framework - A Practical Experience

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

A Survey on Product Aspect Ranking

Pentaho Data Mining Last Modified on January 22, 2007

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Distributed Framework for Data Mining As a Service on Private Cloud

Predictive Analytics. Omer Mimran, Spring Challenges in Modern Data Centers Management, Spring

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

How To Understand How Weka Works

Interactive Data Mining and Visualization

Using multiple models: Bagging, Boosting, Ensembles, Forests

Comparison of K-means and Backpropagation Data Mining Algorithms

Customer Classification And Prediction Based On Data Mining Technique

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Data Mining Solutions for the Business Environment

An Introduction to Data Mining

Introducing diversity among the models of multi-label classification ensemble

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Clustering Data Streams

DATA MINING USING PENTAHO / WEKA

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

How To Perform An Ensemble Analysis

Packet Flow Analysis and Congestion Control of Big Data by Hadoop

CHARACTERIZING OF INFRASTRUCTURE BY KNOWLEDGE OF MOBILE HYBRID SYSTEM

Ensembles and PMML in KNIME

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Sunnie Chung. Cleveland State University

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Why Ensembles Win Data Mining Competitions

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Information Management course

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Comparison of Data Mining Techniques used for Financial Data Analysis

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Financial Trading System using Combination of Textual and Numerical Data

What s Cooking in KNIME

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Active Learning SVM for Blogs recommendation

STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS

A Comparative Study of Different Log Analyzer Tools to Analyze User Behaviors

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Random forest algorithm in big data environment

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

From Big Data to Smart Data

Preprocessing Web Logs for Web Intrusion Detection

Analytics on Big Data

Data Mining and Database Systems: Where is the Intersection?

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

How To Create A Text Classification System For Spam Filtering

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Final Project Report

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Analysis Tools and Libraries for BigData

Studying Auto Insurance Data

Distributed forests for MapReduce-based machine learning

Chapter 6. The stacking ensemble approach

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Massive Online Analysis Manual

Knowledge Discovery and Data Mining

Role of Social Networking in Marketing using Data Mining

Data Mining. Nonlinear Classification

Transcription:

Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Abstract : Term of data mining was available in mid of 1990 s. Previously offered through shell script, command line argument, filtering, pipeline. Today its awkward. Now days no of tools are developed by research community in data mining. They provides GUI interface so users can easily communicate. Also they are provide free of cost using open source license. Because of open source users can extend any new method and also provides flexibility in handling various types of data. Here we describe open source tools for data mining and data stream mining. Here we introduce R, WEKA, ORANGE, KNIME and MOA. Our proposed structure use open source tool MOA. MOA contains collections of online and offline for both classification and clustering as well as tools for evaluation. And for data stream mining today MOA is best tool. Keywords : MOA, WEKA, Data Stream, Big data, Novel class. I. INTRODUCTION Data Stream means continuous flow of data. Example of data stream include computer network traffic, phone conversation, ATM transaction, Web Searches and Sensor data. Data Stream Mining is a process of extracting knowledge structure from continuous, rapid data records. [2]Its can be considered as a subfield of data mining. Data Stream can be classified into online streams and offline streams. Online Data stream mining used in a number of real world applications, including network traffic monitoring, intrusion detection and credit card fraud detection. And offline data stream mining used in like generating report based on web log streams. Characteristics of data stream is continuous flow of data. Data size is extremely large and potentially infinite. It s not possible to store all data. But major problems related to data stream mining : Infinite length, concept evolution and concept drift. Infinite length means data stream have a infinite length so require infinite length storage and training time.[3] Concept evolution means developing novel class and concept drift means data changes over time. For our thesis main topic on concept evolution emergence of novel class. Novel class does not exist if we assume total no of classes are fixed. But some time data stream classification problem occur like intrusion detection, text classification and fault detection. So this assumption is not valid for real streaming environment. When new classes may be evolve at any time. Most existing data stream classification technique ignore this important aspect of stream data is the arrival of a Novel Class. Concept evolution solve the problem of infinite length and concept drift.[1] Now days no of quantity of data generated. Sometimes it was not possible to store all data. Data stream real time analytics are needed to manage data currently generated. Increasing rate from such applications like sensor networks, measurements in network monitoring, traffic management, call detail records, blogging and twitter posts. In data stream model data arrive at high speed and algorithm must process do under constraints of space and time. In data stream mining we are interested in three dimensions. Copyright to IJIRSET www.ijirset.com 5234

1. Accuracy 2. Amount of space(computer memory) 3. Time This dimensions are typically independent. Issue of measurement of three dimensions simultaneously in data stream mining.[4] II. BIG DATA Big data is new term used to identify data set that due to large size. We can not manage them with data mining software tools. Big data mining is the capability of extracting useful information from largest data set or streams of the data. Big data analytics is prove as a important tool in improve efficiency and quality in organization. There are two methods for dealing with big data : sampling and distributed system. Sampling method is used when data set is too large. A good sampling method try to select best instances too have good performance using small quantity of time and memory. And distributed systems used now days are based on map reduced framework. Map reduced model divides algorithm in two steps: map and reduced. The input data is split into different data sets and each split is send to a mapper that will transform the data. The output of the mappers will be combined into reducers that will be output of final organization.[4] Tools : Open Source Revolution Early model inference and machine learning programs from the 1980 s were most often invoked from a command prompt. Researchers mostly used scripting language as a Perl to separate implement ampling procedures and then execute programs. To compare different algorithms such scripts need to reformat data for each algorithm, parse textual outputs from each model and use them to compute corresponding performance scores. Implementation of this type require so much text processing and programming which is needless. Flexibility and extensibility in analysis software arise from being able to use existing code to develop or extend one s own algorithms. A. R R is a language and environment for statistical computing and graphics. Most of its computationally intensive methods are implemented in c, c++, Fortan and then interfaced with R, a scripting language.[4] R includes extensive variety of techniques for statistical testing, predictive modeling and data visualization that become de facto standard open source library for statistics. Interface to R is command line and use through scripting. Extension of R is implemented as an R library and provides a graphical user interface to many of R s data analysis and modeling functions.[5] B. Weka Weka is best known machine learning and data mining environment. User can access components through JAVA programming or through a command line interfaces. Weka provides graphical user interface in an application called the Weka Knowledge Flow Environment featuring visual programming and Weka explorer. Weka is much weaker in classical testing than R but stronger in a machine learning. There are mainly two ways to use weka t conduct your data mining tasks.[4] - Use Weka GUI(graphical user interface) o GUI is o in KNIME with two data types models and set of instances. But benefit of giving the user more control in setting up details of the experiment, such as separate preprocessing and training and testing example Copyright to IJIRSET www.ijirset.com 5235

sets. Here large straightforward and easy to use. But it is not flexible. It can not be called from you own application. - Import Weka Java library to your own Java application o Developers can leverage on Weka java library to develop software or modify the source code to meet special requirements. Its more flexible and advanced. But it is not easy as GUI. Overall goal of Weka to build a state-of-the-art facility for developing machine learning technique and allow people to apply them to real world data mining problems.[8] C. KNIME KNIME is nicely designed data mining tool that run inside the IBε s Eclipse development Environment. Its written in java and can extend its library of built in supervised and unsupervised data mining algorithm with provided by Weka. Each node performs a certain function, such as reading data, filtering, modeling, visualization. Nodes have input and output ports. Some node handle data model as classification trees.[4] D. Orange Orange is data mining suite built using the same principle as KNIME and Weka knowledgeflow. In its graphical environment called Orange Canvas, the user places widgets on a canvas and connects them into a schema. Each widget performs some basic function, but unlike number of different visualization of data and models including intelligent search for good visualization. Orange is weak in classical statistics. It provides no widget for statistical testing. Computationally intensive parts of orange are written in C++ where as upper layers are developed in scripting language Python.[4] E. MOA MOA( Massive On-Line Analysis) is a framework for data stream mining. It includes tools for evalution and collection of machine learning algorithm. It has implementation of classification, regression, Clustering, frequent pattern mining and frequent graph mining. Related to the WEKA project it also implemented in JAVA. It includes a collection of offline and online as well as tools for evaluation: classification and clustering. Easy to design, extend and run experiments. The goal of MOA framework for running experiments in data stream mining context by proving - Storable setting for data streams for repeatable experiments. - A set of existing algorithm and measure from literature for comparison. - An easily extendable framework for new streams, algorithms and evaluation methods. Workflow in MOA: first a data stream is chosen and configured, second an algorithm is chosen and its parameters are set and third evaluation method or measure is chosen and finally results are obtained after running the task. To run an experiment using MOA, the user can choose between GUI or command line execution.[7] Copyright to IJIRSET www.ijirset.com 5236

Fig-1 Classification Experimental Setting Fig-2 Configure Task Copyright to IJIRSET www.ijirset.com 5237

III. PROPOSED WORK MOA contains several classifier methods such as : Naïve Bayes, Decision Stump, Hoeffding tree, Hoeffding Option tree, Bagging, Boosting etc. Hoeffding Tree: A Hoeffding Tree is an incremental, any time decision tree induction algorithm that is capable of learning from massive data streams. Hoeffding Tree can often be enough to choose an optimal splitting attribute by Hoeffding Bound. The Hoeffding bound states that with probability 1 the true mean of a random variable of range R will not differ from estimated mean after n independent observation by more than: = R 2 ln(1/ ) / 2n This bound is useful because it holds true regardless of the distribution generating values.[10] Hoeffding Option Tree Hoeffding option tree are regular hoeffding trees containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. They consists a single structure that efficiently represents multiple trees.[9] IV. PROPOSED STRUCTURE Here above we discuss about problem of novel class. When we classify data then some data are classified and some are misclassified. But misclassified instances class previously not defined then we consider its not outlier it s a novel class. It s a problem of data stream mining so we can use MOA tool for that and classifier method is Hoeffding Option Tree is used. But in proposed structure voting method of its changed. And also we get more accuracy because here we can use Hoeffding Option Tree. V. CONCLUSION State-of-the-art open source data mining provides GUI, focus on usability, interactivity and extendibility. Here we discuss some open source tools for data mining and data stream mining. Our proposed work on data stream mining. So most suitable tool is MOA. MOA build provides experimental framework for classification and clustering. Also this provides classifier Hoeffding tree and Hoeffding option tree. So using that we can modify method and improve accuracy to detect novel class. REFERENCES [1] Mohammad M Masud, Tahseen M, Al-khateeb, Latifur Khan, Charu Aggrawal, Jing Gao, Jiawei Han and Bhawani Thuraisinghum Detecting Recurring and Novel classes in Concept Drift Data Streams icdm, IEEE 11 th International Conference On Data Mining, pp. 1176-1181, 2011. [2] S.Thanngamani DYNAMIC FEATURE SET BASED CLASSIFICATION SCHEME UNDER DATA STREAMS International Journal Of Communication And Engineering Volume 04 No.04, Issue:01 March-201. [3] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham Classification And Novel Class Detection In Data Stream With Active Mining M.J.Zaki etal.(eds.): PAKDD 2010, Part II,LNAI 6119, pp.311-324 Springer- Verlag Berlin Heidelberg 2010. [4] Albert Bifet Mining Big Data in Real Time Infomatica 37 (2013) pp. 15-20. [5] Blaz Zupan, PhD a,b,*, Janez Demsar, PhD a Open Source Tools for Data Mining, Clin Lab Med 28 (2008) pp. 37-54. [6] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer, Timm Jansen and Thomas Seidl MOA: Massive Online Analysis, a framework for Stream Classification and Clustering. [7] MOA, http://moa.cms.waikato.ac.nz. Copyright to IJIRSET www.ijirset.com 5238

[8] WEKA, http://www.cs.waikato.ac.nz/ml/weka. [9] Geoffrey Holmes, Richard Kirkby, and Bernhard P Fahringer Mining Data Stream Using Option Trees(revised edition 2004). [10] Pedro Domingos, Geoff Hulten Mining High-Speed Data Streams in proceeding of the 6 th ACMSIGKDD International Conference On Knowledge Discovery and Data Mining, pp.71-80, ACM, August-2000 Copyright to IJIRSET www.ijirset.com 5239