DATA MINING ALPHA MINER

DATA MINING ALPHA MINER AlphaMiner is developed by the E-Business Technology Institute (ETI) of the University of Hong Kong under the support from the Innovation and Technology Fund (ITF) of the Government of the Hong Kong Special Administrative Region (HKSAR). It is an open source data mining platform that offers versatile data mining model building and data cleansing features with an user friendly workflow interface. Workflow style case construction enables general business managers in construction of a data mining case by simple drag-and-drop operations.plug-able component architecture provides extensibility for adding new BI capabilities in data import and export, data transformations, modeling algorithms, model assessment and deployment. Data mining capabilities from Xelopes and Weka have been incorporated in the first release.versatile data mining functions offer powerful analytics to conduct industry specific analysis including customer profiling and clustering, product association analysis, classification and prediction. CLEMENTINE Clementine data mining tool kit was originally developed by the Integral Solutions Limited. The Company was later merged by SPSS Inc in 1999.SPSS (Statistical Package for the Social Sciences) is a software package for comprehensive data mining (not its initial objective) and analytic applications for enhanced decision making. The strong power of SPSS lays on the statistical analysis it contains a series systematic statistic functions, from descriptive analysis, parametric and nonparametric tests, to nonlinear regressions. Clementine is regarded as a supply to SPSS by providing many intelligent modeling functions (compared to the traditional statistical techniques). C5.0 is one of such example. Clementine and SPSS run independently. However, for enhancing Clementine s specialty and avoiding losing its generality in statistic analysis, Clementine not only embeds most of SPSS functions into its interface but also provides facility to export its process to

SPSS.As a data mining tool, Clementine follows the basic preprocessing-modeling-post processing routine to reveal the information and knowledge behind the data. IBM INTELLIGENT MINER IM is based on a client-server architecture. The server can run on OS/390,OS/400, AIX, Sun/Solaris, or WindowsNT, and the client can be installed on either of AIX, OS/2, WindowsNT, or Windows95. It has the ability to handle large quantities of data, shelter users from the inner workings of the underlying mining technology, present results in an easy to understand fashion, and provide programming interfaces. Increasing numbers of mining applications that deploy mining results are being developed by customers, IBM, and IBM partners. Through an intuitive graphical user interface (GUI) you can visually design data mining operations. You can choose tools and customize them to meet your requirements. The available tools cover the whole spectrum of data mining functions. In addition, IM selects data, explores it, transforms it, and visually interprets the results for productive and efficient knowledge discovery. The data analyst handles the development, and the business analyst handles the application work. The server runs the mining and processing functions, and stores the historical data and the mining results. The client manipulates the data with the visualization tools, and can be used to visually build a data mining operation, run it on the server, and have the results returned for visualization and further analysis. In addition, the IM application programming interface (API) provides C++ classes and methods as well as C structures and functions for application programmers. KNIME KNIME (Konstanz Information Miner) is a user friendly, intelligible, and comprehensive open-source data integration, processing, analysis, and exploration platform. It gives users the ability to visually create data flows or pipelines, selectively execute some or all analysis steps, and later study the results, models, and interactive views. KNIME is written in

Java, and it is based on Eclipse and makes use of its extension method to support plugins thus providing additional functionality. Through plugins, users can add modules for text, image, and time series processing and the integration of various other open source projects, such as R programming language, Weka, the Chemistry Development Kit, and LibSVM. KXEN: KXEN is an American software company based San Fransisco, California. The company primarily manufactures predictive analytics software. KXEN provides a complete datamining environment that includes data access, data manipulation, data aggregation, text extraction, data encoding, model training, reporting, model deployment, scoring code export, and model maintenance. Its Modeling Assistant user interface gives complete control of all the processes necessary to create and deploy understandable and powerful predictive models. InfiniteInsight is a predictive modeling suite developed by KXEN that assists analytic professionals, and business executives to extract information from data. has been designed to allow the prediction of a behavior or a value, the forecast of a time series or the understanding of a group of individuals with similar behavior. Oracle data mining Oracle Data Mining (ODM) embeds data mining within the Oracle database. ODM algorithms operate natively on relational tables or views, thus eliminating the need to extract and transfer data into standalone tools or specialized analytic servers. ODM's integrated architecture results in a simpler, more reliable, and more efficient data management and analysis environment. Data mining tasks can run asynchronously and independently of any specific user interface as part of standard database processing pipelines and applications. Data analysts can mine the data in the database, build models and methodologies, and then turn those results and methodologies into full-fledged application components ready to be deployed in production environments. The benefits of the integration with the database cannot be emphasized enough

when it comes to deploying models and scoring data in a production environment. ODM allows a user to take advantage of all aspects of Oracle's technology stack as part of an application. Also, fewer "moving parts" results in a simpler, more reliable, more powerful advanced business intelligence application. ODM provides single-user multi-session access to models. ODM programs can run either asynchronously or synchronously in the Java interface. ODM programs using the PL/SQL interface run synchronously; to run PL/SQL asynchronously requires using the Oracle Scheduler. For a brief description of the ODM interfaces, see "Java and PL/SQL Interfaces". ORANGE Orange is a component-based data mining and machine learning software suite, featuring friendly yet powerful and flexible visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting. It includes comprehensive set of components for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. It is implemented in C++ (speed) and Python (flexibility). Its graphical user interface builds upon cross-platform Qt framework. Orange is distributed free under the GPL. It is maintained and developed at the Bioinformatics Laboratory of the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. RAPIDMINER RapidMiner, formerly YALE (Yet Another Learning Environment), is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications. In a poll by KDnuggets, a data-mining newspaper, RapidMiner ranked second in data mining/analytic tools used for real projects in 2009 and was first in 2010. It is distributed under the AGPL open source license and has been hosted by SourceForge since 2004.

RapidMiner provides data mining and machine learning procedures including: data loading and transformation (ETL), data preprocessing and visualization, modelling, evaluation, and deployment. The data mining processes can be made up of arbitrarily nestable operators, described in XML files and created in RapidMiner's graphical user interface (GUI). RapidMiner is written in the Java programming language. It also integrates learning schemes and attribute evaluators of the Weka machine learning environment and statistical modelling schemes of the R-Project. The Community Edition of RapidMiner is a toolkit for data mining. It is able to define analytical steps (similar to R), and in generating graphs like MS Excel. It is also used for analyzing data generated by high-throughput instruments used in processes such as genotyping, proteomics, and mass spectrometry. RapidMiner can be used for text mining, multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining. RapidMiner was rated as the fifth most used text mining software (6%) by Rexer's Annual Data Miner Survey in 2010. RapidMiner is found in the: electronics industry, energy industry, automobile industry, commerce, aviation, telecommunications, banking and insurance, production, IT industry, market research, pharmaceutical industry and other fields. SPSS SPSS is a computer program used for survey authoring and deployment (IBM SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, statistical analysis, and collaboration and deployment (batch and automated scoring services). SPSS (originally, Statistical Package for the Social Sciences) was released in its first version in 1968 after being developed by Norman H. Nie and C. Hadlai Hull. SPSS is among the most widely used programs for statistical analysis in social science. It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others. The original SPSS manual (Nie, Bent & Hull, 1970) has

been described as one of "sociology's most influential books". In addition to statistical analysis, data management (case selection, file reshaping, creating derived data) and data documentation (a metadata dictionary is stored in the datafile) are features of the base software. SPSS can read and write data from ASCII text files (including hierarchical files), other statistics packages, spreadsheets and databases. SPSS can read and write to external relational database tables via ODBC and SQL. Statistical output is to a proprietary file format (*.spv file, supporting pivot tables) for which, in addition to the in-package viewer, a stand-alone reader can be downloaded. The proprietary output can be exported to text or Microsoft Word, PDF, Excel, and other formats. Alternatively, output can be captured as data (using the OMS command), as text, tab-delimited text, PDF, XLS, HTML, XML, SPSS dataset or a variety of graphic image formats (JPEG, PNG, BMP and EMF). SPSS Server is a version of SPSS with a client/server architecture. It had some features not available in the desktop version, such as scoring functions. Tanagra Tanagra (http://eric.univ-lyon2.fr/wricco/tanagra/) is a data mining suite built around a graphical user interface wherein data processing and analysis components are organized in a tree-like structure in which the parent component passes the data to its children (Fig. 2). For example, to score a prediction model in Tanagra, the model is used to augment the data table with a column encoding the predictions, which is then passed to the component for evaluation. Although lacking more advanced visualizations, Tanagra is particularly strong in statistics, offering a wide range of uni- and multivariate parametric and nonparametric tests. Equally impressive is its list of feature selection techniques. Together with a compilation of standard machine learning techniques, it also includes correspondence analysis, principal component analysis, and the partial least squares methods. Presentation of machine learning models is most often not graphical, but-instead unlike other machine learning suites-includes several statistical

measures. The difference in approaches is best illustrated by the naive Bayesian classifier, whereby, unlike Weka and Orange, Tanagra reports the conditional probabilities and various statistical assessments of importance of the attributes (eg, c2,cramer s V, and Tschuprow s t). Tanagra s data analysis components report their results in a nicely formatted HTML. Teradata Teradata is an enterprise software company that develops and sells a relational database management system (RDBMS) with the same name. In February, 2011, Gartner ranked Teradata as one of the leading companies in data warehousing and enterprise analytics. Teradata was a division of the NCR Corporation, which acquired Teradata on February 28, 1991. Teradata's revenues in 2005 were almost $1.5 billion with an operating margin of 21%. On January 8, 2007, NCR announced that it would spin-off Teradata as an independently traded company, and this spin-off was completed October 1 of the same year, with Teradata trading under the NYSE stock symbol TDC. The Teradata product is referred to as a "data warehouse system" and stores and manages data. The data warehouses use a "shared nothing architecture," which means that each server node has its own memory and processing power. Adding more servers and nodes increases the amount of data that can be stored. The database software sits on top of the servers and spreads the workload among them. Teradata sells applications and software to process different types of data. In 2010, Teradata added text analytics to track unstructured data, such as word processor documents, and semi-structured data, such as spreadsheets. Teradata's product can be used for business analysis. Data warehouses can track company data, such as sales, customer preferences, product placement, etc. Ethical Companies. In 2010, the Ethisphere Institute named Teradata as one of the "World's Most

WEKA Written in Java, Weka (Waikato Environment for Knowledge Analysis) is a wellknown suite of machine learning software that supports several typical data mining tasks, particularly data preprocessing, clustering, classification, regression, visualization, and feature selection. Its techniques are based on the hypothesis that the data is available as a single flat file or relation, where each data point is labeled by a fixed number of attributes. Weka provides access to SQL databases utilizing Java Database Connectivity and can process the result returned by a database query. Its main user interface is the Explorer, but the same functionality can be accessed from the command line or through the component-based Knowledge Flow interface. XL MINER XLMiner for Excel for Windows is the only comprehensive data mining add-in for Excel, with neural nets, classification and regression trees, logistic regression, linear regression, Bayes classifier, K-nearest neighbors, discriminant analysis, association rules, clustering, principal components, and more. Moreover, it is an excellent DM get started tool. It can be called a Business Intelligence tool. XLMiner provides solutions that are statistical as well as machine learning oriented. Hence, there are numerous ways to try to solve a problem and it is the task of a miner to determine which method would be most appropriate to his problem. XLMiner has been developed by Resampling Stats. Inc. Resampling Stats is located in Arlington, Virginia, USA. In the summer of 2006 it was merged into statistics.com, LLC. It usually makes and markets software that are related to statistics.