MLPACK: A Scalable C++ Machine Learning Library
|
|
|
- Richard York
- 9 years ago
- Views:
Transcription
1 MLPACK: A Scalable C++ Machine Learning Library Ryan R. Curtin [email protected] James R. Cline [email protected] Neil P. Slagle [email protected] Matthew L. Amidon [email protected] Alexander G. Gray [email protected] School of Computational Science and Engineering Georgia Institute of Technology Atlanta, GA Abstract MLPACK is a new, state-of-the-art, scalable C++ machine learning library, which will be released in early December Its aim is to make large-scale machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C++ language features to provide maximum performance and maximum flexibility for expert users. MLPACK provides many cutting-edge algorithms, including but not limited to tree-based k-nearest-neighbors, fast hierarchical clustering, MVU (maximum variance unfolding) with LRSDP, RAD- ICAL (ICA), and HMMs. Each of these algorithms is highly flexible and configurable, and all of them can be configured to use sparse matrices for datasets. Benchmarks are shown to prove that MLPACK scales more effectively than competing toolkits. Future plans for MLPACK include support for parallel algorithms, support for disk-based datasets, as well as the implementation of a large collection of cutting-edge algorithms. More information on MLPACK is available at the project s website, which is located at 1 Introduction and Goals While there are many machine learning libraries available to the public for an assortment of machine learning tasks, there are very few libraries targeted at the average user that focus on scalability. For instance, the popular WEKA toolkit [7] focuses on ease of use but does not scale well. On the other end of the continuum, the focus of the distributed Apache Mahout [12] library is scalability using larger, more overhead-intensive setups such as clusters and powerful servers; the average user is unlikely to have this. In addition, many other libraries implement only a few methods; for instance, libsvm [3] and the Tilburg Memory-Based Learner (TiMBL) [5] are highly capable and scalable software but each only implements one method. A user must learn a new API for each method. MLPACK, which is meant to be the machine learning analog to the general-purpose LAPACK linear algebra library, aims to bridge the gap between scalability and ease of use. The library is written in C++ using the Armadillo matrix library [10]. MLPACK uses templated-based optimization techniques to eliminate unnecessary copying of datasets and optimize mathematical expressions in ways the compiler s optimizer cannot. In addition, MLPACK exploits the generic programming features of C++. This allows a user to easily customize machine learning methods to their own purposes without incurring any performance penalty. To our knowledge, MLPACK is the only machine learning library to exploit these techniques to their full potential. 1
2 In addition, the consistent, intuitive interface of MLPACK minimizes the learning time necessary for either an undergraduate student or a machine learning expert. Unlike many other scientific software packages, complete and comprehensible documentation is a high priority for MLPACK, and each method that MLPACK implements provides detailed reference. The development team of MLPACK strives to balance four overarching goals in the design of the library: Implement scalable, fast machine learning algorithms Design an intuitive, consistent, and simple API for users who are not C++ gurus Implement as large a collection as possible of machine learning methods Provide cutting-edge machine learning algorithms that no other library does This paper is meant to serve as a quick introduction to the features MLPACK provides, its simple and extensible API, and the superior performance and scalability of the library. The ways in which each of the four design goals are achieved are analyzed. 2 Scalable, Fast Machine Learning Algorithms The most important feature of MLPACK is the scalability of the machine learning algorithms it implements; this scalability is achieved mostly by the use of C++. C++ is not the most popular choice of language for most machine learning implementations; instead, researchers generally choose simpler languages such as MATLAB, C, FORTRAN, and sometimes Java or Python. While each of those languages is modern and supports modern featuers, none are as complex as C++. C++ is a highly complex language which requires training and expertise to fully understand and utilize. Unlike most other languages used in machine learning, C++ supports generic programming via templates. Although Java also supports generics, the C++ template mechanism is far more extensible and useful. In the late 1990s and early 2000s, shortly after the introduction of templates into C++, the field of template metaprogramming emerged, culminating, in part, with Alexandrescu s seminal Modern C++ Design [2] in The main appeal of template metaprogramming is the ability to move parts of computation that are regularly performed at runtime into compile-time. However, this concept has not gained significant traction in the field of machine learning software, even ten years after the introduction of template metaprogramming. MLPACK s underlying matrix library, Armadillo, uses lazy evaluation to avoid both unnecessary copies and unnecessary operations [10]. To illustrate this, consider the following very simple expression (assuming x, y, z, and w are square matrices of the same size): w = x + (y + z) If implemented as written in MATLAB, this will first compute (y + z), store it in a temporary matrix, then add x to the temporary matrix and store the result in w. Clearly, for massive datasets, the creation of an unnecessary temporary matrix causes both memory and performance issues. The Armadillo library, relying on lazy evaluation using template metaprogramming, is able to avoid this entirely by optimizing operation performance at compile-time. The vast majority of machine learning libraries do not support this type of compile-time optimization. Another drawback of most commonly-used scientific languages is suboptimal support of userdefined metrics and kernels. For instance, Shogun [11] provides arbitrary distance metric support via inheritance and virtual functions, but the use of virtual functions or other dynamic bindings incurs a performance penalty for the lookup of the dynamic binding [8]. MLPACK, instead, uses static polymorphism an extension of the Curiously Recurring Template Pattern [4] as well as policybased design [2]. The performance penalty of the dynamic binding is eliminated by resolving the binding entirely at compilation time. The use of policy-based design allows incredible flexibility. MLPACK supports entirely arbitrary, user-defined distance metrics or kernel functions. More importantly, through this paradigm ML- 2
3 PACK is able to support the use of both dense and sparse matrices anywhere in each of the algorithms it implements. No other machine learning library to date provides this functionality. 3 A Consistent, Simple API The use of complex template techniques in C++ generally incurs complex and unwieldy APIs, forcing a user to read through countless pages of documentation to understand how to use the code. However, because a goal of MLPACK is to provide an intuitive and simple API, these issues can be avoided. This is possible, in part, by providing default parameters which can be left unspecified. For instance, the Neighborhood Components Analysis metric learning method can be created very easily (assuming data is an Armadillo matrix containing the dataset): NCA nca(data); However, an expert user could run NCA in a non-euclidean metric space, using a custom optimizer, on sparse double-precision floating point input data: NCA<ACustomMetric, ACustomOptimizer, arma::spmat<double> > nca(data); Each MLPACK method is designed similarly, and importantly, each MLPACK method has a consistent API. A user can move from one method to another, expecting to interact with the new method in the same way. This consistent API is a main reason that MLPACK s API can be called intuitive. In addition, stringent documentation is a key requirement for an algorithm to be accepted into ML- PACK. This comprehensive documentation allows a novice user to simply run the basic algorithm and an expert user to customize the algorithm to their specific needs. 4 A Large Collection of Machine Learning Methods MLPACK implements commonly used machine learning algorithms as well as algorithms for which no other implementation is available. It provides a set of core routines, including several optimizers, such as L-BFGS and the Nelder-Mead method. MLPACK also contains a strong set of tree-building and tree query routines, making MLPACK s implementations of neighbor-search problems and general n-body problems [6] very powerful. As mentioned earlier, MLPACK even supports sparse matrices for any of its methods through the policy-based design C++ paradigm [2]; this innovation is novel to the machine learning software community. Each machine learning method provides a well-documented command-line interface for any researcher who does not wish to extend MLPACK s code but instead use the provided methods. By using the library in this manner, a user of MLPACK does not even need to know C++ or any programming language. On the other hand, for an expert researcher, MLPACK allows arbitrary distance metrics and kernel functions for all of its methods, among its many customizable features. Below is a list of the machine learning methods which will be available, along with some key features for each of them, at the time of the library s release in December 2011: Fast Hierarchical Clustering (Euclidean Minimum Spanning Trees) Gaussian Mixture Models (trained via EM) Hidden Markov Models (training, prediction, and classification) Kernel Principal Components Analysis K-Means clustering LARS/Lasso Regression Least-squares Linear Regression Maximum Variance Unfolding (using LRSDP) Naive Bayes Classifier Neighborhood Components Analysis (using LRSDP) RADICAL (Robust, Accurate, Direct ICA algorithm) 3
4 Dataset MLPACK WEKA Shogun MATLAB mlpy sklearn 1000x s 0.271s 0.132s 0.166s 0.179s 0.341s 3162x s 1.065s 1.093s 0.796s 0.974s 0.916s 10000x s 4.734s s 4.026s 9.961s 3.549s 31622x s s s s s s x s s s s s s x s s > s s > s s x s failure failure s > s s 10000x s s s s s s 10000x s s s s s s 10000x s s s s s s 10000x s failure s s s s Table 1: k-nn benchmarks. Tree-based k-nearest-neighbors search and classifier Tree-based range search In addition, many methods are currently in development and will be released in the future. 5 Benchmarks To demonstrate the efficiency of the algorithms implemented in MLPACK, the running time of k- nearest-neighbors is compared with other libraries. Only one algorithm is tested for the sake of conciseness, but when the library is released, full benchmarks for each algorithm will be available. The benchmarks shown here were run on a modest consumer-grade workstation containing an AMD Phenom II X6 1100T processor clocked at 3.3 GHz and 8 GB of RAM. The libraries being compared are MLPACK, WEKA [7], the MATLAB knnsearch() routine, the Shogun Toolkit [11], the Python package mlpy [1], and the Python package scikit.learn ( sklearn ) [9]. Several randomly generated, uniformly distributed datasets of varying sizes were used for this benchmark. The computation time of k-nn with k = 5 for each library and each dataset size is given in Table 1. The listed computational time includes the time taken to load the datasets (CSV) and save the results. MLPACK is faster than every competitor for every case tested here. It is clear that MLPACK scales more gracefully and effectively than any of its competitors for k-nearest-neighbors. Benchmarks for other methods (which cannot be shown here due to space constraints) give similar results. 6 Future Plans In spite of the favorable benchmarks shown here, these are by no means the best benchmarks ML- PACK will ever achieve. MLPACK has an active development team of seven core developers, as well as several contributors. Because MLPACK is an open-source project, contributions from any outside source are welcome as well as feature requests and bug reports. This means that the performance of MLPACK algorithms, as well as the extensibility of MLPACK algorithms and the breadth of algorithms MLPACK implements, are all certain to improve. For the first release of MLPACK, parallelism was unable to be implemented elegantly, but experimental parallel MLPACK code is currently in testing. Work is currently ongoing into how to implement parallelism while both maintaining a simple API and not incurring large, reverse-incompatible API changes. Parallel algorithms are expected to be available in the next major release of MLPACK. Other useful planned features include the ability to use on-disk databases, instead of requiring the dataset to be loaded entirely into RAM. Many additional algorithms are either planned or in development for MLPACK and will be released in the future. These algorithms include dimensionality reduction techniques such as linear discriminant analysis (LDA) and locally linear embedding (LLE), source separation methods such 4
5 as non-negative matrix factorization (NMF) and information maximization (Infomax) ICA, kernel density estimation methods, and a variety of other algorithms. 7 Conclusion We have shown that MLPACK is a state-of-the-art scalable C++ machine learning library which leverages the powerful C++ concept of generic programming to give excellent performance on large datasets. The four goals which MLPACK development adheres to requires that in addition to being very fast, the code itself must be easy for a novice user to understand and use. MLPACK supports a wide range of machine learning methods, each of which is highly configurable. For instance, a user can write an arbitrary distance metric or kernel function and use it with MLPACK methods. A user can also choose to use sparse or dense matrices for their datasets at will, which is flexibility other machine learning libraries do not offer. In the future, MLPACK aims to support parallelism while maintaining a simple API, as well as publishing new state-of-the-art machine learning algorithms regularly. Effective interaction with the machine learning community through the channels of MLPACK s open-source development model, found at enables the library to be shaped to the needs of the community, allowing MLPACK to be an effective tool in advancing the field of machine learning. Acknowledgements A full list of developers and researchers other than the authors who have contributed significantly to MLPACK: Sterling Peet, Vlad Grantcharov, Ajinkya Kale, Bill March, Dongryeol Lee, Nishant Mehta, Parikshit Ram, Chip Mappus, Hua Ouyang, Long Quoc Tran, and Noah Kauffman. References References [1] Davide Albanese, Giuseppe Jurman, Roberto Visintainer, and Stefano Merler. mlpy: Machine Learning PYThon. October Project homepage at [2] Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley Professional, Indianapolis, February [3] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, Software available at http: // cjlin/libsvm. [4] James Coplien. C++ Gems. Cambridge University Press & Sigs Books, [5] Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. TiMBL: Tilburg Memory Based Learner, version 4.0, Reference Guide. Technical Report 01-04, ILK, [6] Alexander G. Gray and Andrew W. Moore. N-Body problems in Statistical Learning. In Advances in Neural Information Processing Systems 14, volume 4, pages Citeseer, [7] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), [8] ISO/IEC. Technical Report on C++ Performance. Technical report, February [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and Duchesnay E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: , [10] Conrad Sanderson. Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments. Technical report, NICTA, [11] Soeren Sonnenburg, Gunnar Raetsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11: , June [12] The Apache Foundation. Apache Mahout. October Project homepage and software download at 5
fml The SHOGUN Machine Learning Toolbox (and its python interface)
fml The SHOGUN Machine Learning Toolbox (and its python interface) Sören Sonnenburg 1,2, Gunnar Rätsch 2,Sebastian Henschel 2,Christian Widmer 2,Jonas Behr 2,Alexander Zien 2,Fabio de Bona 2,Alexander
GURLS: A Least Squares Library for Supervised Learning
Journal of Machine Learning Research 14 (2013) 3201-3205 Submitted 1/12; Revised 2/13; Published 10/13 GURLS: A Least Squares Library for Supervised Learning Andrea Tacchetti Pavan K. Mallapragada Center
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
MAV Stabilization using Machine Learning and Onboard Sensors
MAV Stabilization using Machine Learning and Onboard Sensors CS678 Final Report Cooper Bills and Jason Yosinski {csb88,jy495}@cornell.edu December 17, 21 Abstract In many situations, Miniature Aerial Vehicles
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
MEUSE: RECOMMENDING INTERNET RADIO STATIONS
MEUSE: RECOMMENDING INTERNET RADIO STATIONS Maurice Grant [email protected] Adeesha Ekanayake [email protected] Douglas Turnbull [email protected] ABSTRACT In this paper, we describe a novel Internet
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
Orange: Data Mining Toolbox in Python
Journal of Machine Learning Research 14 (2013) 2349-2353 Submitted 3/13; Published 8/13 Orange: Data Mining Toolbox in Python Janez Demšar Tomaž Curk Aleš Erjavec Črt Gorup Tomaž Hočevar Mitar Milutinovič
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
PREA: Personalized Recommendation Algorithms Toolkit
Journal of Machine Learning Research 13 (2012) 2699-2703 Submitted 7/11; Revised 4/12; Published 9/12 PREA: Personalized Recommendation Algorithms Toolkit Joonseok Lee Mingxuan Sun Guy Lebanon College
Automated Content Analysis of Discussion Transcripts
Automated Content Analysis of Discussion Transcripts Vitomir Kovanović [email protected] Dragan Gašević [email protected] School of Informatics, University of Edinburgh Edinburgh, United Kingdom [email protected]
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
From Big Data to Smart Data
Int'l Conf. Frontiers in Education: CS and CE FECS'15 75 From Big Data to Smart Data Teaching Data Mining and Visualization Antonio Sanchez, Lisa Burnell Ball Department of Computer Science Texas Christian
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014
Machine Learning in Python with scikit-learn O Reilly Webcast Aug. 2014 Outline Machine Learning refresher scikit-learn How the project is structured Some improvements released in 0.15 Ongoing work for
WEKA Experiences with a Java Open-Source Project
Journal of Machine Learning Research 11 (2010) 2533-2541 Submitted 6/10; Revised 8/10; Published 9/10 WEKA Experiences with a Java Open-Source Project Remco R. Bouckaert Eibe Frank Mark A. Hall Geoffrey
Waffles: A Machine Learning Toolkit
Journal of Machine Learning Research 12 (2011) 2383-2387 Submitted 6/10; Revised 3/11; Published 7/11 Waffles: A Machine Learning Toolkit Mike Gashler Department of Computer Science Brigham Young University
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas University of Washington and Alex
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Maximizing Precision of Hit Predictions in Baseball
Maximizing Precision of Hit Predictions in Baseball Jason Clavelli [email protected] Joel Gottsegen [email protected] December 13, 2013 Introduction In recent years, there has been increasing interest
Performance and efficacy simulations of the mlpack Hoeffding tree
Performance and efficacy simulations of the mlpack Hoeffding tree Ryan R. Curtin and Jugal Parikh November 24, 2015 1 Introduction The Hoeffding tree (or streaming decision tree ) is a decision tree induction
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Improving Credit Card Fraud Detection with Calibrated Probabilities
Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data Johannes Daxenberger, Oliver Ferschke, Iryna Gurevych and Torsten Zesch UKP Lab, Technische Universität Darmstadt Information
Classification of Electrocardiogram Anomalies
Classification of Electrocardiogram Anomalies Karthik Ravichandran, David Chiasson, Kunle Oyedele Stanford University Abstract Electrocardiography time series are a popular non-invasive tool used to classify
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
CSCI-599 DATA MINING AND STATISTICAL INFERENCE
CSCI-599 DATA MINING AND STATISTICAL INFERENCE Course Information Course ID and title: CSCI-599 Data Mining and Statistical Inference Semester and day/time/location: Spring 2013/ Mon/Wed 3:30-4:50pm Instructor:
Spotting false translation segments in translation memories
Spotting false translation segments in translation memories Abstract Eduard Barbu Translated.net [email protected] The problem of spotting false translations in the bi-segments of translation memories
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business
BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang ([email protected]) Lecture-Discussions:
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
ANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
MSCA 31000 Introduction to Statistical Concepts
MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced
What s Cooking in KNIME
What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014
Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation
Open Source Software Tools for Anomaly Detection Analysis
Open Source Software Tools for Anomaly Detection Analysis by Robert F. Erbacher and Robinson Pino ARL-MR-0869 April 2014 Approved for public release; distribution unlimited. NOTICES Disclaimers The findings
e-commerce product classification: our participation at cdiscount 2015 challenge
e-commerce product classification: our participation at cdiscount 2015 challenge Ioannis Partalas Viseo R&D, France [email protected] Georgios Balikas University of Grenoble Alpes, France [email protected]
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
Structural Health Monitoring Tools (SHMTools)
Structural Health Monitoring Tools (SHMTools) Getting Started LANL/UCSD Engineering Institute LA-CC-14-046 c Copyright 2014, Los Alamos National Security, LLC All rights reserved. May 30, 2014 Contents
Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015
Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015 Lecture: MWF: 1:00-1:50pm, GEOLOGY 4645 Instructor: Mihai
Information and Decision Sciences (IDS)
University of Illinois at Chicago 1 Information and Decision Sciences (IDS) Courses IDS 400. Advanced Business Programming Using Java. 0-4 Visual extended business language capabilities, including creating
Machine Learning. 01 - Introduction
Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable
An Introduction to WEKA. As presented by PACE
An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/
SQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Data Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Empowering the Masses with Analytics
Empowering the Masses with Analytics THE GAP FOR BUSINESS USERS For a discussion of bridging the gap from the perspective of a business user, read Three Ways to Use Data Science. Ask the average business
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Visualisation in the Google Cloud
Visualisation in the Google Cloud by Kieran Barker, 1 School of Computing, Faculty of Engineering ABSTRACT Providing software as a service is an emerging trend in the computing world. This paper explores
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
MSCA 31000 Introduction to Statistical Concepts
MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Rackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
A Comparative Study of clustering algorithms Using weka tools
A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Crowdfunding Support Tools: Predicting Success & Failure
Crowdfunding Support Tools: Predicting Success & Failure Michael D. Greenberg Bryan Pardo [email protected] [email protected] Karthic Hariharan [email protected] tern.edu Elizabeth
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
