Orange: Data Mining Toolbox in Python
|
|
|
- Roy Daniels
- 9 years ago
- Views:
Transcription
1 Journal of Machine Learning Research 14 (2013) Submitted 3/13; Published 8/13 Orange: Data Mining Toolbox in Python Janez Demšar Tomaž Curk Aleš Erjavec Črt Gorup Tomaž Hočevar Mitar Milutinovič Martin Možina Matija Polajnar Marko Toplak Anže Starič Miha Štajdohar Lan Umek Lan Žagar Jure Žbontar Marinka Žitnik Blaž Zupan Faculty of Computer and Information Science University of Ljubljana Tržaška 25, SI-1000 Ljubljana, Slovenia Abstract Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting 1. Introduction Scripting languages have recently risen in popularity in all fields of computer science. Within the context of explorative data analysis, they offer advantages like interactivity and fast prototyping by gluing together existing components or adapting them for new tasks. Python is a scripting language with clear and simple syntax, which also made it popular in education. Its relatively slow execution can be circumvented by using libraries that implement the computationally intensive tasks in lowlevel languages. Python offers a huge number of extension libraries. Many are related to machine learning, including several general packages like scikit-learn (Pedregosa et al., 2011), PyBrain (Schaul et al., 2010) and mlpy (Albanese et al., 2012). Orange was conceived in late 1990s and is among the oldest of such tools. It focuses on simplicity, interactivity through scripting, and component-based design. c 2013 Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik and Blaž Zupan
2 DEMŠAR, CURK, ERJAVEC ET AL. 2. Toolbox Overview Orange library is a hierarchically-organized toolbox of data mining components. The low-level procedures at the bottom of the hierarchy, like data filtering, probability assessment and feature scoring, are assembled into higher-level algorithms, such as classification tree learning. This allows developers to easily add new functionality at any level and fuse it with the existing code. The main branches of the component hierarchy are: data management and preprocessing for data input and output, data filtering and sampling, imputation, feature manipulation (discretization, continuization, normalization, scaling and scoring), and feature selection, classification with implementations of various supervised machine learning algorithms (trees, forests, instance-based and Bayesian approaches, rule induction), borrowing from some well-known external libraries such as LIBSVM (Chang and Lin, 2011), regression including linear and lasso regression, partial least square regression, regression trees and forests, and multivariate regression splines, association for association rules and frequent itemsets mining, ensembles implemented as wrappers for bagging, boosting, forest trees, and stacking, clustering, which includes k-means and hierarchical clustering approaches, evaluation with cross-validation and other sampling-based procedures, functions for scoring the quality of prediction methods, and procedures for reliability estimation, projections with implementations of principal component analysis, multi-dimensional scaling and self-organizing maps. The library is designed to simplify the assembly of data analysis workflows and crafting of data mining approaches from a combination of existing components. Besides broader range of features, Orange differs from most other Python-based machine learning libraries by its maturity (over 15 years of active development and use), a large user community supported through an active forum, and extensive documentation that includes tutorials, scripting examples, data set repository, and documentation for developers. Orange scripting library is also a foundation for its visual programming platform with graphical user interface components for interactive data visualization. The two major packages that are similar to Orange and are still actively developed are scikitlearn (Pedregosa et al., 2011) and mlpy (Albanese et al., 2012). Both are more tightly integrated with numpy and at present better blend into Python s numerical computing habitat. Orange was on the other hand inspired by classical machine learning that focuses on symbolic methods. Rather than supporting only numerical arrays, Orange data structures combine symbolic, string and numerical attributes and meta data information. User can for instance refer to variables and values by their names. Variables store mapping functions, a mechanism which for instance allows classifiers to define transformations on training data that are then automatically applied when making predictions. These features also make Orange more suitable for interactive, explorative data analysis. 2350
3 ORANGE: DATA MINING TOOLBOX IN PYTHON 3. Scripting Examples Let us illustrate the utility of Orange through an example of data analysis in Python shell: >>> import Orange >>> data = Orange.data.Table("titanic") >>> len(data) 2201 >>> nbc = Orange.classification.bayes.NaiveLearner() >>> svm = Orange.classification.svm.SVMLearner() >>> stack = Orange.ensemble.stacking.StackedClassificationLearner([nbc,svm]) >>> res = Orange.evaluation.testing.cross_validation([nbc, svm, stack], data) [ , , ] We first read the data on survival of 2,201 passengers from HMS Titanic and construct a set of learning algorithms: a naive Bayesian and SVM learner, and a stacked combination of the two (Wolpert, 1992). We then cross-validate the learners and report the area under ROC curves. Running stacking on the subset of about 470 female passengers improves AUC score: >>> females = Orange.data.Table([d for d in data if d["sex"]=="female"]) >>> len(females) 470 >>> res = Orange.evaluation.testing.cross_validation([stack], females) [ ] We can use existing machine learning components to craft new ones. For instance, learning algorithms must implement a call operator that accepts the training data and, optionally, data instance weights, and has to return a model. The following example defines a new learner that encloses another learner into a feature selection wrapper: it sorts the features by their information gain (as implemented in Orange.feature.scoring.InfoGain), constructs a new data set with only thembest features and calls thebase learner. class FSSLearner(Orange.classification.PyLearner): def init (self, base_learner, m=5): self.m = m self.base_learner = base_learner def call (self, data, weights=none): gain = Orange.feature.scoring.InfoGain() best = sorted(data.domain.features, key=lambda x: -gain(x, data))[:self.m] domain = Orange.data.Domain(best + [data.domain.class_var]) new_data = Orange.data.Table(domain, data) model = self.base_learner(new_data, weights) return Orange.classification.PyClassifier(classifier=model) Below we compare the original and wrapped naive Bayesian classifier on a data set with 106 instances and 57 features: >>> data = Orange.data.Table("promoters") >>> len(data), len(data.domain.features) (106, 57) 2351
4 DEMŠAR, CURK, ERJAVEC ET AL. >>> bayes = Orange.classification.bayes.NaiveLearner() >>> res = Orange.evaluation.testing.cross_validation([bayes, FSSLearner(bayes)], data) [ , 0.945] 4. Code Design Orange s core is a collection of nearly 200 C++ classes that cover the basic data structures and majority of preprocessing and modeling algorithms. The C++ part is self-contained, without any calls to Python that would induce unnecessary overhead. The core includes several open source libraries, including LIBSVM (Chang and Lin, 2011), LIBLINEAR (Fan et al., 2008), Earth (see QHull (Barber et al., 1996) and a subset of BLAS (Blackford et al., 2002). The Python layer also uses popular Python libraries numpy for linear algebra, networkx (Hagberg et al., 2008) for working with networks and matplotlib (Hunter, 2007) for basic visualization. The upper layer of Orange is written in Python and includes procedures that are not time-critical. This is also the place at which users outside the core development group most easily contribute to the project. Automated testing of the system relies on over 1,500 regression tests that are mostly based on code snippets from extensive documentation. A part of the code is also covered with stricter unit tests. 5. Availability, Requirements and Plans for the Future Orange is free software released under GPL. The code is hosted on Bitbucket repository ( bitbucket.org/biolab/orange). Orange runs on Windows, Mac OS X and Linux, and can also be installed from the Python Package Index repository (pip install Orange). Binary installer for Windows and application bundle for Mac OS X are available on project s web site (http: //orange.biolab.si). Orange currently runs on Python 2.6 and 2.7. A version for Python 3 and higher is under development. There, we will switch to numpy-based data structures and scrap the C++ core in favor of using routines from numpy and scipy (Jones et al., 2001 ), scikit-learn (Pedregosa et al., 2011) and similar libraries that did not exist when Orange was first conceived. Despite planned changes in the core, we will maintain backward compatibility. For existing users, the changes of the Python interface will be minor. Acknowledgments We would like to acknowledge support for this project from the Slovenian Research Agency (P2-0209, J2-9699, L2-1112), National Institute of Health (P01-HD39691), and Astra Zeneca. We thank the anonymous reviewers for their constructive comments. 2352
5 ORANGE: DATA MINING TOOLBOX IN PYTHON References D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, and C. Furlanello. mlpy: Machine learning Python. CoRR, abs/ , C. B. Barber, D. P. Dobkin, and H. T. Huhdanpaa. The Quickhull algorithm for convex hulls. ACM Trans. on Mathematical Software, 22(4), L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, and G. Henry. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software, 28(2): , C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9: , A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), pages 11 15, Pasadena, CA USA, J. D. Hunter. Matplotlib: A 2D graphics environment. Computing In Science & Engineering, 9(3): 90 95, E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, URL F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg. scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12: , T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain. Journal of Machine Learning Research, 11: , D. H. Wolpert. Stacked generalization. Neural Networks, 5(2): ,
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Car Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
An Introduction to WEKA. As presented by PACE
An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/
Computational Tools for Big Data Python Libraries
Computational Tools for Big Data Python Libraries Finn Årup Nielsen DTU Compute Technical University of Denmark September 15, 2015 Overview Numpy numerical arrays with fast computation Scipy computation
8. Machine Learning Applied Artificial Intelligence
8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Open source framework for data-flow visual analytic tools for large databases
Open source framework for data-flow visual analytic tools for large databases D5.6 v1.0 WP5 Visual Analytics: D5.6 Open source framework for data flow visual analytic tools for large databases Dissemination
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
CRASH COURSE PYTHON. Het begint met een idee
CRASH COURSE PYTHON nr. Het begint met een idee This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab 2 Python Scripting language expensive
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
ROC Curve, Lift Chart and Calibration Plot
Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo 178627 Database And Data Mining Research Group Summary RapidMiner project Strengths How to use RapidMiner Operator
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
ANSA and μeta as a CAE Software Development Platform
ANSA and μeta as a CAE Software Development Platform Michael Giannakidis, Yianni Kolokythas BETA CAE Systems SA, Thessaloniki, Greece Overview What have we have done so far Current state Future direction
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Curriculum Vitae Ruben Sipos
Curriculum Vitae Ruben Sipos Mailing Address: 349 Gates Hall Cornell University Ithaca, NY 14853 USA Mobile Phone: +1 607-229-0872 Date of Birth: 8 October 1985 E-mail: [email protected] Web: http://www.cs.cornell.edu/~rs/
Predictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Scientific Programming in Python
UCSD March 9, 2009 What is Python? Python in a very high level (scripting) language which has gained widespread popularity in recent years. It is: What is Python? Python in a very high level (scripting)
Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence
Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
DATA MINING ALPHA MINER
DATA MINING ALPHA MINER AlphaMiner is developed by the E-Business Technology Institute (ETI) of the University of Hong Kong under the support from the Innovation and Technology Fund (ITF) of the Government
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Predicting outcome of soccer matches using machine learning
Saint-Petersburg State University Mathematics and Mechanics Faculty Albina Yezus Predicting outcome of soccer matches using machine learning Term paper Scientific adviser: Alexander Igoshkin, Yandex Mobile
Software Development at the LMU - Munich. Ideas, Vision, Commitment?
1 Software Development at the LMU - Munich Ideas, Vision, Commitment? ObsPy 2 3 ObsPy ObsPy Python toolbox for seismologists Goal: facilitate rapid application development for seismology Modular extensible
Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014
Machine Learning in Python with scikit-learn O Reilly Webcast Aug. 2014 Outline Machine Learning refresher scikit-learn How the project is structured Some improvements released in 0.15 Ongoing work for
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction
ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction Barun Paudel 1, T.H. Gopaluwewa 1, M.R.De. Waas Gunawardena 1, W.C.H. Wijerathna 1, Rohan Samarasinghe
Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin
Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)
Exercise 0. Although Python(x,y) comes already with a great variety of scientic Python packages, we might have to install additional dependencies:
Exercise 0 Deadline: None Computer Setup Windows Download Python(x,y) via http://code.google.com/p/pythonxy/wiki/downloads and install it. Make sure that before installation the installer does not complain
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction Challenge
An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction Challenge Thomas Hoch Software Competence Center Hagenberg GmbH Softwarepark 21, 4232 Hagenberg, Austria Tel.: +43-7236-3343-831
e-commerce product classification: our participation at cdiscount 2015 challenge
e-commerce product classification: our participation at cdiscount 2015 challenge Ioannis Partalas Viseo R&D, France [email protected] Georgios Balikas University of Grenoble Alpes, France [email protected]
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Fortgeschrittene Computerintensive Methoden
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
270107 - MD - Data Mining
Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 015 70 - FIB - Barcelona School of Informatics 715 - EIO - Department of Statistics and Operations Research 73 - CS - Department of
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Waffles: A Machine Learning Toolkit
Journal of Machine Learning Research 12 (2011) 2383-2387 Submitted 6/10; Revised 3/11; Published 7/11 Waffles: A Machine Learning Toolkit Mike Gashler Department of Computer Science Brigham Young University
CS 6220: Data Mining Techniques Course Project Description
CS 6220: Data Mining Techniques Course Project Description College of Computer and Information Science Northeastern University Spring 2013 General Goal In this project, you will have an opportunity to
Orange Data Mining Library Documentation
Orange Data Mining Library Documentation Release 3 Orange Data Mining June 20, 2016 Contents 1 Tutorial 1 1.1 The Data................................................. 1 1.1.1 Data Input............................................
An overview of free software tools for general data mining
An overview of free software tools for general data A. Jović *, K. Brkić * and N. Bogunović * * Faculty of Electrical Engineering and Computing, University of Zagreb / Department of Electronics, Microelectronics,
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
Improving Credit Card Fraud Detection with Calibrated Probabilities
Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Interactive Data Mining and Visualization
Interactive Data Mining and Visualization Zhitao Qiu Abstract: Interactive analysis introduces dynamic changes in Visualization. On another hand, advanced visualization can provide different perspectives
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
Data Analytics at NERSC. Joaquin Correa [email protected] NERSC Data and Analytics Services
Data Analytics at NERSC Joaquin Correa [email protected] NERSC Data and Analytics Services NERSC User Meeting August, 2015 Data analytics at NERSC Science Applications Climate, Cosmology, Kbase, Materials,
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Keywords data mining, prediction techniques, decision making.
Volume 5, Issue 4, April 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analysis of Datamining
Scientific Programming, Analysis, and Visualization with Python. Mteor 227 Fall 2015
Scientific Programming, Analysis, and Visualization with Python Mteor 227 Fall 2015 Python The Big Picture Interpreted General purpose, high-level Dynamically type Multi-paradigm Object-oriented Functional
CSci 538 Articial Intelligence (Machine Learning and Data Analysis)
CSci 538 Articial Intelligence (Machine Learning and Data Analysis) Course Syllabus Fall 2015 Instructor Derek Harter, Ph.D., Associate Professor Department of Computer Science Texas A&M University - Commerce
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1
Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints
Data Warehousing and Data Mining in Business Applications
133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business
Index Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
