Text Analytics using High Performance SAS & SAS Enterprise Miner
|
|
|
- Gregory Mills
- 10 years ago
- Views:
Transcription
1 Text Analytics using High Performance SAS & SAS Enterprise Miner Edward R. Jones, Ph.D. Texas A&M Statistical Services, LP Abstract: The latest release of SAS Enterprise Miner, version 13.1, contains high performance modules, including new modules for text mining. This paper compares the new High Performance Text Mining modules to those found in SAS Text Miner. The advantages and disadvantages of HP Text Miner are discussed. This is illustrated using data from the National Highway and Transportation Safety Administration for the GMC recall of 258,000 SUVs for a potential fire hazard. INTRODUCTION In August 2014, General Motors announced the third recall to correct a life- threatening hazard. The new recall was for six models: Cobalt s, Pontiac Solstices, Pontiac G5s, Saturn Sky s, Chevrolet HHRs and Saturn Ions. GM claimed 13 deaths were related to faulty ignition switches after reports of 28 fires in the power windows and locks of these vehicles from The NHTSA, National Highway Traffic Safety Administration, maintains a public database of consumer complaints for these and other vehicles. The question is whether these complaints could have been used to identify this hazard earlier than Over 1 million complaints have been registered and are available for download from the NTHSA. To examine this question, both SAS Text Miner and SAS HP Text Miner, version 13.1, were used to analyze over 11 thousand complaints recorded by the NTHSA for this latest recall. These data include not only written complaints but also information on the vehicle, the driver and whether or not the vehicle was involved in a fire, a crash, and the number of injuries and deaths. This paper describes how SAS can be used to analyze these data. The analysis is done using both the original text miner procedures and also the new SAS HP Text Miner. The purpose is to compare and contrast these procedures, and to illustrate the advantages of High Performance Text Miner. This application illustrates the utility of incorporating unstructured data into a traditional analysis of structure data. SAS Enterprise Miner combined with the new SAS HP Text Miner allows analysts to not only discover what customers are complaining about but why. SAS TEXT MINER SAS Text Miner is incorporated within SAS Enterprise Miner and shares the same syntax. However, now SAS analysts have two options for text mining, each with a different character: Page 1
2 SAS Text Miner and SAS High Performance (HP) Text Miner. The first option has been available for some time, but the second only became available with Versions 9.4 of SAS and 13.1 of SAS Enterprise Miner. SAS Text Miner consists of seven nodes, but in most applications, five are used in the following order. 1. Text Import reads text files stored in a single directory 2. Text Parsing creates the Term/Document matrix 3. Text Filter allows analysts to eliminate non- useful words 4. Text Topic identifies document groups associated with topic words 5. Text Cluster identifies document clusters from their word structures. Figure 1. The Text Mining Process using SAS Text Miner The text import node requires all documents to be placed in individual files. Unfortunately, the NTHSA data file contained all 11,363 complaints in one file. These had to be extracted and placed into 11,363 files. The process of extracting individual comments and placing them into separate files was accomplished using a small Java file described at the end of this paper. The Java program reads a tab- delimited file containing the comment identifier and the comment. Each comment is written to a separate text file with a unique name. The name incorporates the comment ID, which makes it easy to associate individual comments with cluster and topic groupings from the text mining process. The analytics process of for text mining these complaints is described in Figure 1 above. The default document size in text import is 100 characters. This had to be increased to 2100 characters. Text parsing was done using the default settings. Text filtering was done using inverse document frequency term weights, and the filter viewer was used to review terms to identify those that were not useful in this analysis. This consisted mainly of the initials of the NTHSA recorder. Page 2
3 Using the default settings, the text cluster node identified 10 clusters and 37 SVD vectors. The text topic node was used iteratively to identify 9 user word clusters including one word cluster for 1,416 complaints about a key and ignition problem, 1,500 complaints about a brake problem, and 7 others for various mechanical problems such as air bag and steering problems. Table 1. Topic Groups Identified by Topic Node Topic Group Complaints Description Air Bag Problem Brake Problem Door Problem Electrical Fuel System Key & Ignition Recall Start Problem Steering Problem The output from text mining process consisted of 9 binary attributes associated with the 9 user developed topics and 3 interval attributes consisting of the estimated probabilities for the three text clusters. These were merged into the original structured data to produce a single file for modeling whether fires in these vehicles might be related to some of these complaint topic groups or text clusters. After the merge, the roles of all attributes were modified as either rejected, input or target. The text topic groups and text cluster estimated probabilities were all marked as inputs. The SEMMA process in Enterprise Miner was followed. After preparing this integrated file, outliers were set to missing using the replacement node, and missing values were imputed. Page 3
4 The model step was implemented using the logistic regression node, the decision tree node and neural network node. Experience with problems involving many input attributes has found it s prudent to reduce the number of input attributes using variable selection. The logistic regression and decision tree models all had the same misclassification error rate for predicting whether the vehicle was involved in a crash based upon the topic of the complaint and the driver and vehicle characteristics. The logistic regression model developed using stepwise selection had the smallest Validation Average Squared Error (VASE). It incorporated only one of the input attributes: text topic 2, which was the topic related to a door switch problem. The GMC recall reported that the fires were related to shorts in the door switch. The complaints to point to problems related to the key & ignition problem. We now understand that this problem caused steering and braking problems related to vehicle crashes. Combining those three categories suggests the NHTSA received 4710 complaints related to the key and ignition problem. That s over 40% of all complaints for these vehicles. SAS HIGH PERFORMANCE TEXT MINER SAS High Performance Text Miner is also incorporated within SAS Enterprise Miner since version The latest version, 13.1 is an improvement and expansion over There are several features that differentiate High Performance Text Miner from the original version of SAS Text Miner. First, all High Performance nodes are designed to run faster by using high- speed algorithms and multi- core and cluster capabilities. Second, the SVD process in the HP Text Miner not incorporates a rotation of the principal vectors from the SVD, which helps with the interpretation of topics and document groupings. This sets HP Text Miner above its competitors. Another important difference in text mining is that unlike SAS Text Miner, High Performance Text Miner does not require all documents to be stored in individual files inside a single directory. Instead, High Performance Text Miner will accept any SAS file that contains two attributes, one with the role key and a second Page 4
5 with the role text. The key field should be a unique interval value associated with each of the text fields. This type of input is ideal for the NTHSA data, surveys and other data containing written comments. In SAS 9.4, these features are implemented using Proc HPTMINE and Proc HPTMSCORE. These were described in the SAS references at the end of this paper. In this example, the key field was the unique NTHSA ID assigned to each complaint and the text field was the written comment. In these data it was restricted to a maximum of 2100 characters, but High Performance Text Miner can accept written comments up to 32,000 characters. The Sample- Explore- Modify process in SEMMA using High Performance Text Miner is much simpler than what was described for SAS Text Miner. First, there is no need to separate the text mining from the original data. In addition, at this time there is no equivalent to the Text Filter and Text Topic nodes. Instead, High Performance Text Miner by default exports topic groups defined by the SVD analysis. These will be orthogonal to one another. Using the default setting for the High Performance Text Miner node produced 32 topic groups. These were marked automatically as interval inputs. Version 13.1 of High Performance Enterprise Miner also contains a new Impute node that was used to estimate missing values, such as vehicle mileage. Page 5
6 Table 2. Topic Groups Identified by HP Text Miner Topic Group Complaints Description Deploy, bag, air bag, air, airbag Police, Report, Police Report, file, sustain Accident, cause, sustain, result, avoid Airbag, passenger, seat, side, seatbelt Hit, cause, curb, curb, side, in front of Driver, side, side, passenger, roll Break, break, pedal, apply, stop mph, vehicle, approximately, fail, wheel Crash, pole, curb, tree, cause Notice that the topic attributes are named COL1 COL20, but High Performance Text Miner puts the names of the words associated with each attribute in its label. Notice that COL4 is associated with a comment about a vehicle fire. COL1 and COL4 are talking airbag deployment. None have an obvious relationship to keys or ignition. The modeling approach is almost idential to that used with SAS Text Miner. Stepwise logistic regression is used, along with decision tree and neural network. However there is one additional modeling node that was used this is only available in the High Performance nodes: High Performance Random Forests. Page 6
7 In this case, the model that produced the smallest misclassification error is the Neural Network Model. It was 2.8%, lower than that seen with the non- High Performance nodes. The new High Performance Neural Network provides clues about the relative importance of attributes in the model. First the network is displayed with colorful lines. The darker the line, the larger the weight for that attribute. The new HP Neural Network Node also produces a visual display of the network. In this case, the inputs to the network were pre- screened using the HP Variable Selection Node. In general a neural network tends to converge faster if the variables are pre- screened. Page 7
8 In this case the attributes with the highest weights were text topics. Another plot that describes the relative size of the network weights is the weights plot. From this it s apparent the the topics with the larger weights are COL2, COL11, COL24, COL21 and COL20. These correspond to the following toics: COL2 Police Report COL11 damage COL24 Impact, collision COL21 File Claim COL20 Insurance Company These make sense as indicators of an accident (crash). However, in this analysis, the key and ignition are not one of selected topics. Page 8
9 This is also seen in the importance statistics from the new HP Random Forest node. Page 9
10 The HP Tree Node produces a more compact solution. SUMMARY Both SAS Text Miner and High Performance Text Miner were able to uncover 4,710 complaints (41%) that were associated with the key and ignition problem in selected models. The latest version of High Performance Text Miner, version 13.1, is easier to use since it processes a single file containing multiple comments. In addition, the modeling output from High Performance Neural Network models and Random Forest Models makes it easier to identify key complaint topics associated with the target. CONTACT INFORMATION Comments and questions are valued and encouraged. Please feel free to the author: Ed Jones ejones at- stat dot tamu dot edu Texas A&M University, Dept. of Statistics: Texas A&M Statistical Services: Page 10
11 REFERENCES Text Mining & Big Data Azevedo, A. and Santos M.F. (2008) KDD, SEMMA and CRISP- DM: A parallel overview. In the proceedings of the IADIS European Conference on Data Mining 2008, pp Available from After joining, search on the title for a pdf copy. Berry, M.W. and Kogan, B. (2010) Text Mining: Applications & Theory; John Wiley & Sons. Chapman, P., Clinton, J., et.al. (2000) CRISP- DM 1.0: Step- by- step data mining guide. Available from IBM at: ftp://ftp.software.ibm.com/software/analytics/spss/support/modeler/documentation/14 /UserManual/CRISP- DM.pdf Fayyad, U.M. et.al. (1996) From data mining to knowledge discovery: an overview. In Fayyad, U. M et.al. Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press. Ingersoll, G.S., Morton, T.S. and Farris, A.L. (2013) Taming Text; Manning Publications. Liu, B. (2011) Web Data Mining, 2 nd Edition; Springer. Russell M. A. (2011) Mining the Social Web; O Reilly. Sathi, A. (2012) Big Data Analytics: Disruptive Technologies for Changing the Game; MC Press Online. Weiss, S.M., Indurkhya, N. and Zhang, T. (2005) Text Mining: Predictive Methods for Analyzing Unstructured Information; Springer. Weiss, S.M., Indurkhya, N. and Zhang, T. (2010) Fundamentals of Predictive Text Mining; Springer. Linguistics Bird, S., Klein, E. and Loper, E. (2009) Natural Language Processing with Python; O Reilly. Cambria, E. and Hussain, A. (2012) Sentic Computing: Techniques, Tools, and Applications; Springer. Indurkhya, N. and Damerau, R. J. (2010) Handbook of Natural Language Processing; CRC Press. Page 11
12 Liu, B. (2012) Sentiment Analysis and Opinion Mining; Morgan & Claypool Publishers. Manning C.D. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing, MIT Press. Pustejovsky, J. and Stubbs, A. (2012) Natural Language Annotation for Machine Learning; O Reilly. Sogaard A. (2013) Semi- Supervised Learning and Domain Adaption in Natural Language Processing; Morgan & Claypool Publishers. Wilcock, G. (2009) Introduction to Linguistic Annotation and Text Analytics Computer Science Lin, J. and Dyer, C. (2010) Data- Intensive Text Processing with MapReduce; Morgan & Claypool Publishers. Holmes, A. (2012) Hadoop in Practice; Manning Publishers. Lam, C. (2011) Hadoop in Action; Manning Publishers. SAS SAS Technical Paper Enabling High- Performance Text Mining Procedures, High- Performance- Text- Mining- Procedures.pdf Zhao, Z., Albright, R., Cox, J. and Bieringer, A. (2013) Big Data Meets Text Mining, SAS Global Forum 2013, Statistics Bartlett, Randy (2013) A practitioner s Guide to Business Analytics, McGraw- Hill. Japkowicz, N. and Shah, M. (2011) Evaluating Learning Algorithms: A Classical Perspective; Cambridge Press. Text Analytics Software Page 12
13 Chakraborty, G., Pagolu, M., and Garla, S. (2013) Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS; SAS, Inc. Cunningham, H. et.al. (2012) Developing Language Processing Components with GATE Version 7 (a User Guide); online Elder, J., Miner, G. and Nisbet, R. (2011) Practical Text Mining and Statistical Analysis for Non- Structured Text Data Applications Academic Press. Nisbet, R., Elder, J. and Miner, G. (2009) Statistical Analysis and Data Mining Applications; Academic Press. Voice Recognition Rabiner, L and Jung, F (1993) Fundamentals of Speech Recognition, Prentice Hall. Jelinek, F. (1998) Statistical Methods for Speech Recognition, MIT Press. Page 13
14 ACRONYMS & SOFTWARE TOOLS AAP... Advanced Analytics Platform BA... Business Analytics BI... Business Intelligence CRISP- DM... Cross Industry Standard Process for Data Mining DMP... Data Management Platform DSP... Demand Side Platform (also Digital Signal Processor) DW... Data Warehouse ETL... Extract Load Transform EM... (SAS) Enterprise Miner GATE... General Architecture for Text Engineering (software) Hadoop... Java Class for HDFS (distributed system for text files) HP... High Performance HDFS... Hadoop Distributed File System HMM... Hidden Markov Model (used in POS tagging) IBM Modeler... IBM s software for data & text mining, also called IBM SPSS Modeler IE... Information Extraction IR... Information Retrieval KPI... Key Performance Indicator LDC... Linguistic Data Consortium ( MapReduce... Java Class for Hadoop designed to perform analytics on HDFS files MPP... Massively Parallel Platform NER... Named Entity Resolution NLP... Natural Language Processing NLTK... Natural Language Toolkit (Python software kit) PCA... Principal Components Analysis PII... Personally Identifiable Information POS... Parts- of- Speech. (Sometimes used for point of sale.) RapidMiner... R- based package for data mining (software) ROC... Receiver Operating Characteristic curve. SAS... Statistical Analysis System (software) SAS TM... SAS Text Miner (software) SEMMA... Sample, Explore, Modify, Model and Assess (SAS Acronym) SPSS... Statistical Package for the Social Sciences (software) SSP... Supply Side Platform STATISTICA... Statistical Analysis Package (software) SVD... Singular Value Decomposition - algorithm SVM... Support Vector Machine algorithm for multi- classification problems UIMA... Unstructured Information Management Architecture Page 14
15 JAVA PROGRAM FOR POPULATING A SELECTED DIRECTORY WITH TEXT FILES package fileprocess; import java.io.*; import java.util.*; public class FileProcess { /** args the command line arguments */ public static void main(string[] args) { // TODO code application logic here System.out.println("Text Mining Comments Start -->"); System.out.println("Opening Input File -->"); /* Set path to directory containing the text file, and set the second argument to the name of that file */ File infile = new File("/Users/Home/Desktop/Java/","gmc_ignrecalls.txt"); if (infile.exists()) { System.out.println("\tFound comments file "+infile); else { System.out.println("\tDid not find comments file "+infile+"\n"); System.exit(0); /* Set path to directory used to store the individual file */ File outdir = new File("/Users/Home/Desktop/Java/gmc_ignrecalls/"); if (outdir.exists()) { System.out.println("\tFound output directory "+outdir+"\n"); else { System.out.println("\tDid not find output directory"+outdir+"\n"); System.exit(0); try { FileReader incomments = new FileReader(inFile); FileWriter outcomments = null; BufferedReader breader = new BufferedReader(inComments); BufferedWriter bwriter = null; String str = null; String surveyid = null; String surveydate = null; String comment = null; String filename = null; StringTokenizer strfields = null; int nline = 1; int nocomment = 0; int ncomment = 0; int ncommentplus = 0; System.out.println("Reading Input File 1st Line-->"); str = breader.readline(); System.out.println("\tLine "+nline+" "+str+"\n"); Page 15
16 nline++; System.out.println("Reading Comments -->"); while ((str = breader.readline())!= null) { strfields = new StringTokenizer(str, "\t\n"); int nfields = strfields.counttokens(); if (nfields > 0) surveyid = strfields.nexttoken(); /* if (nfields > 1) surveydate = strfields.nexttoken(); */ if (nfields > 1) { str = strfields.nexttoken(); comment = str.trim(); if (comment.length() > 0) { ncomment++; /* set first string to the path of the directory you are using to store the individual text files */ filename = "/Users/Jones/Desktop/Java/gmc_ignrecalls/Survey"+surveyID+".txt"; outcomments = new FileWriter(fileName); bwriter = new BufferedWriter(outComments); outcomments.write(comment); outcomments.close(); if (nfields > 3) { System.out.println("ERROR - Extra Token in Line: "+nline+" "+str); ncommentplus++; nline++; incomments.close(); System.out.println("Closed Input File "); nocomment = nline-2-ncomment-ncommentplus; System.out.println("Total Comment Files -->"+(nline-2)); System.out.println("Files without a comment -->"+nocomment); System.out.println("Files with one or more comments-->"+ncomment); System.out.println("Files with two or more comments-->"+ncommentplus); catch (IOException e) { System.out.println(e); System.exit(1); Page 16
Text Analytics using High Performance SAS Text Miner
Text Analytics using High Performance SAS Text Miner Edward R. Jones, Ph.D. Exec. Vice Pres.; Texas A&M Statistical Services Abstract: The latest release of SAS Enterprise Miner, version 13.1, contains
CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING
CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. Is there valuable
COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout
COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three
Paper 3508-2015. Downtime of a truck = Truck repair end date - Truck repair start date
Paper 3508-2015 Using Text from Repair Tickets of a Truck Manufacturing Company to Predict Factors that Contribute to Truck Downtime Ayush Priyadarshi and Dr. Goutam Chakraborty, Oklahoma State University
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
Introduction to Data Mining
Introduction to Data Mining José Hernández ndez-orallo Dpto.. de Systems Informáticos y Computación Universidad Politécnica de Valencia, Spain [email protected] Horsens, Denmark, 26th September 2005
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
ANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
ASK THE CAR WHAT HAPPENED
ASK THE CAR WHAT HAPPENED Reconstructing traffic accidents through the use of a vehicle s black box technology. Christie Swiss Attorney at Law Collins Collins Muir + Stewart LLP Oakland South Pasadena
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
ANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,
Does the Federal government require them? No, the Federal government does not require manufacturers to install EDRs.
EDR Q&As THE BASICS What is an EDR? What is its purpose? An Event Data Recorder (EDR) is a function or device installed in a motor vehicle to record technical vehicle and occupant information for a brief
Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features
Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics [email protected] www.twitter.com/charliedatamine
Data Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern
2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 1 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 2 PART 0. INTRODUCTION TO BIG DATA PART 1. MAPREDUCE AND THE NEW SOFTWARE STACK 1. DISTRIBUTED
How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK
How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK Agenda Analytics why now? The process around data and text mining Case Studies The Value of Information
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
The Prophecy-Prototype of Prediction modeling tool
The Prophecy-Prototype of Prediction modeling tool Ms. Ashwini Dalvi 1, Ms. Dhvni K.Shah 2, Ms. Rujul B.Desai 3, Ms. Shraddha M.Vora 4, Mr. Vaibhav G.Tailor 5 Department of Information Technology, Mumbai
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone [email protected] June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Course Syllabus. Purposes of Course:
Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
New York Study of Booster Seat Effects on Injury Reduction Compared to Safety Belts in Children Aged 4-8 in Motor Vehicle Crashes
New York Study of Booster Seat Effects on Injury Reduction Compared to Safety Belts in Children Aged 4-8 in Motor Vehicle Crashes Kainan Sun, Ph.D., Michael Bauer, M.S. Sarah M. Sperry, M.S., Susan Hardman
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC
Technical Paper (Last Revised On: May 6, 2013) Big Data Analytics Benchmarking SAS, R, and Mahout Allison J. Ames, Ralph Abbey, Wayne Thompson SAS Institute Inc., Cary, NC Accurate and Simple Analysis
IBM SPSS Modeler 15 In-Database Mining Guide
IBM SPSS Modeler 15 In-Database Mining Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 217. This edition applies to IBM SPSS Modeler
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Hexaware E-book on Predictive Analytics
Hexaware E-book on Predictive Analytics Business Intelligence & Analytics Actionable Intelligence Enabled Published on : Feb 7, 2012 Hexaware E-book on Predictive Analytics What is Data mining? Data mining,
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
Data Mining for Fun and Profit
Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Maximizing Return and Minimizing Cost with the Decision Management Systems
KDD 2012: Beijing 18 th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Rich Holada, Vice President, IBM SPSS Predictive Analytics Maximizing Return and Minimizing Cost with the Decision Management
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.
Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December 2010 1 Outline Introduction Statistics vs Knowledge Discovery Predictive Modeling
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Index Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Big Data Can Drive the Business and IT to Evolve and Adapt
Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical
Automotive Black Box Data Recovery Systems
Introduction Automotive Black Box Data Recovery Systems By Don Gilman For years, airplane crash investigators have had the benefit of retrieving data from the flight-data recorder, or "black box." This
Predictive Analytics Certificate Program
Information Technologies Programs Predictive Analytics Certificate Program Accelerate Your Career Offered in partnership with: University of California, Irvine Extension s professional certificate and
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
www.virtualians.pk CS506 Web Design and Development Solved Online Quiz No. 01 www.virtualians.pk
CS506 Web Design and Development Solved Online Quiz No. 01 Which of the following is a general purpose container? JFrame Dialog JPanel JApplet Which of the following package needs to be import while handling
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points
Analytic 360 Our Raison d'être Identify major choice decision points Leverage Analytical Tools and Techniques to solve problems hindering these decision points Empowerment through Intelligence Our Suite
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Data Warehousing and Data Mining in Business Applications
133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business
Tax Fraud in Increasing
Preventing Fraud with Through Analytics Satya Bhamidipati Data Scientist Business Analytics Product Group Copyright 2014 Oracle and/or its affiliates. All rights reserved. 2 Tax Fraud in Increasing 27%
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Sunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
Automated Content Analysis of Discussion Transcripts
Automated Content Analysis of Discussion Transcripts Vitomir Kovanović [email protected] Dragan Gašević [email protected] School of Informatics, University of Edinburgh Edinburgh, United Kingdom [email protected]
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
The 4 Pillars of Technosoft s Big Data Practice
beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed
High Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
CSE 308. Coding Conventions. Reference
CSE 308 Coding Conventions Reference Java Coding Conventions googlestyleguide.googlecode.com/svn/trunk/javaguide.html Java Naming Conventions www.ibm.com/developerworks/library/ws-tipnamingconv.html 2
