BizPro: Extracting and Categorizing Business Intelligence Factors from News

Similar documents
Data Mining - Evaluation of Classifiers

Data Mining Yelp Data - Predicting rating stars from review text

E-commerce Transaction Anomaly Classification

Feature Subset Selection in Spam Detection

Mining a Corpus of Job Ads

Analyze It use cases in telecom & healthcare

Hexaware E-book on Predictive Analytics

Social Media Mining. Data Mining Essentials

Making Sense of the Mayhem: Machine Learning and March Madness

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Chapter 6. The stacking ensemble approach

Machine Learning in Spam Filtering

Text Opinion Mining to Analyze News for Stock Market Prediction

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Active Learning SVM for Blogs recommendation

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Equity forecast: Predicting long term stock price movement using machine learning

How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Applying Machine Learning to Stock Market Trading Bryce Taylor

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Pentaho Data Mining Last Modified on January 22, 2007

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

Predicting Flight Delays

A Logistic Regression Approach to Ad Click Prediction

The Data Mining Process

DATA MINING TECHNIQUES AND APPLICATIONS

CENG 734 Advanced Topics in Bioinformatics

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

Using Artificial Intelligence to Manage Big Data for Litigation

Cross-Validation. Synonyms Rotation estimation

Server Load Prediction

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Keywords social media, internet, data, sentiment analysis, opinion mining, business

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

Experiments in Web Page Classification for Semantic Web

Feature Engineering in Machine Learning

Role of Social Networking in Marketing using Data Mining

Machine Learning for Cyber Security Intelligence

MACHINE LEARNING BASICS WITH R

Neural Networks for Sentiment Detection in Financial Text

Information and Decision Sciences (IDS)

Decision Support System For A Customer Relationship Management Case Study

Data Mining Techniques for Prognosis in Pancreatic Cancer

AUTOMATED PROVISIONING OF CAMPAIGNS USING DATA MINING TECHNIQUES

Introduction to Data Mining

Azure Machine Learning, SQL Data Mining and R

An Introduction to Data Mining

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Classification of Bad Accounts in Credit Card Industry

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

A Content based Spam Filtering Using Optical Back Propagation Technique

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Office: LSK 5045 Begin subject: [ISOM3360]...

Predictive Data modeling for health care: Comparative performance study of different prediction models

The Artificial Prediction Market

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Exploring the use of Big Data techniques for simulating Algorithmic Trading Strategies

OUTLIER ANALYSIS. Data Mining 1

Challenges of Cloud Scale Natural Language Processing

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Course Description This course will change the way you think about data and its role in business.

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Comparison of machine learning methods for intelligent tutoring systems

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

DATA MINING AND REPORTING IN HEALTHCARE

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Predicting Student Performance by Using Data Mining Methods for Classification

A.I. in health informatics lecture 1 introduction & stuff kevin small & byron wallace

Car Insurance. Prvák, Tomi, Havri

Statistical Feature Selection Techniques for Arabic Text Categorization

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Automatic Text Processing: Cross-Lingual. Text Categorization

Blog Post Extraction Using Title Finding

Modeling Suspicious Detection Using Enhanced Feature Selection

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

How can we discover stocks that will

Data Mining + Business Intelligence. Integration, Design and Implementation

Machine Learning using MapReduce

Machine Learning Capacity and Performance Analysis and R

WHAT DEVELOPERS ARE TALKING ABOUT?

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Car Insurance. Jan Tomášek Štěpán Havránek Michal Pokorný

Introduction to Data Mining

ALGORITHMIC TRADING USING MACHINE LEARNING TECH-

Big Data Analytics for Healthcare

Using News Articles to Predict Stock Price Movements

Learning is a very general term denoting the way in which agents:

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Decision Support Systems

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Forecasting stock markets with Twitter

Dissecting the Learning Behaviors in Hacker Forums

Transcription:

BizPro: Extracting and Categorizing Business Intelligence Factors from News Wingyan Chung, Ph.D. Institute for Simulation and Training wchung@ucf.edu

Definitions and Research Highlights BI Factor: qualitative evidence that influences market reactions on a company Represented as a textual sentence extracted from online news BizPro: An intelligent system that we developed to extract and categorize BI factors from textual news We examined the system s capability in profiling four IT companies BI factors. Statistical Classification Naïve Bayes Logistic Regression SVM OneR We discuss implications for BI analysis and data mining 2

Related Work Qualitative information was found useful in improving understanding of accounting information Various textual clues have been studied for market prediction and fraud detection Stock price prediction, market sentiment analysis, earning announcement analysis, news content analysis Financial fraud detection, bankruptcy prediction, deception detection Text categorization and feature selection Data mining (NB, LR, SVM, NN, etc.), computational linguistics (NMF, LDA, etc.) Feature combination and heuristics for feature selection Scarce work is found on analyzing company BI factors; the approaches may not consider different BI classes in textual documents 3

Proposed System: BizPro BizPro: an intelligent system that automatically extracts and categorizes company BI factors from news articles Input: BI category profile; training documents pre-categorized into BI categories; testing documents (document = key sentence of company news) Output: Categorization of testing documents into BI categories BizPro incorporates domain knowledge of BI analysis, BI factor identification heuristics, and BI categories developed based on literature review and expert guidance. 4

System Components of BizPro Input News Articles Expert Guidance Stop-Word List Human Raters Document Indexer Extracting keywords Removing stop words Sentence extraction Document indexing BI Factor Extractor Modeling of BI factors Term importance computation Risk term selection BI Factor Categorizer Automatic BI profiling One-R, Naïve Bayes Logistic regression, SVM BI factor evaluation BI Category Operations Mgmt. Strategic Management Econ. / Environmt. Ftr. Technological Change Legal Issue BI Profile BI Profile BI Profile BI Profile Category-Specific BI Terms 5

Research Questions How can an intelligent system be developed to model and extract different classes of BI factors from company news articles? What is the performance of our proposed system, BizPro, in comparison with a benchmark algorithm in BI factor categorization? When used in categorizing company-specific BI factors, what are the performances of the automatic categorization techniques used in BizPro? How does BizPro s use of textual features extracted from companyspecific news articles contribute to predicting risk categories? 6

BI Categories and Expert Verification Developed based on literature research, industry standards, and government sources 1. Operations management 2. Economic / environment factor 3. Strategic management 4. Technological change 5. Legal issue Expert verification by a senior VP (with a PhD in business and 7+ yrs experience) in risk mgmt of a brokerage firm Examined the BI categorization and provided comments that were used in developing systems requirements and functionality 7

Case Study: Profiling BI Factors of IT Companies from News Dataset 6859 sentences extracted from 231 news articles about 4 technology companies 8

Sample News Headlines 9

Top Keywords 10

Text Classifiers Naïve Bayes Finding the best class for a given input vector x by maximizing the product of all feature s probabilities of belonging to the class Logistic Regression Predict y = 1 if Predict y = 0 if ŷ 0.5 ŷ<0.5 w Support Vector Machine > x i Finding best weights to form a margin to classify data points One-R (benchmark) Choose the variable with the smallest predictive error to form one rule for classification 1 0.5 ŷ ŷ = 1 1+e w> x i min w 2 + C w2r d, i 2R + subject to ŷ ) ln = w > x i 1 ŷ NX y i (w > x i + b) 1 i for i =1...N i i 11

Performance Metrics We used ten-fold cross validation to evaluate the techniques performances We tested performance of classifying the first 5 statements in an article by the first m top-ranked terms, where m increases from 50 to 790 with 20 as each increment 12

Experimental Results Precision Recall F-measure AUC 13

Comparison Naïve Bayes demonstrates the highest classification performance in terms of precision, recall, and F-measure. NB outperformed SVM and LR in P/R/F, while LR outperformed SVM in P/R/F. LR s performance fluctuates highly with the number of features. LR may produce higher variance than Naïve Bayes when training size is limited, making it less stable in performance. SVM s performance is adversely affected by the increasing number of features. Increasing SVM cost parameter values dramatically increases its performance 14

Discussion and Implication The encouraging results indicate a high potential to use the proposed framework to supplement traditional risk assessment methods. Managers can use the techniques to identify and categorize BI factors from news articles. Provides strong implication for intelligence agencies and organizations that need to handle risk factors from online textual data. 15

Conclusion This research addressed the need for BI profiling and categorization from news articles An intelligent system call BizPro was developed to profile and categorize BI factors from online news articles A study of four technology companies show that BizPro outperformed benchmark technique in P, R, F, and AUC. Feature selection, BI factor classification, and document indexing New application of NB, LR algorithms and their performance New insights on SVM algorithm tuning 16

Research Opportunities Cyber security and privacy Online risk assessment Prediction of threat Modeling human behavior in online transaction Cyber Intelligence Lab at IST http://cyber.ist.ucf.edu/ Text classification and clustering Feature selection Simulation of data and activities Development of classification techniques Funding available: NSF, DoD, DHS, DARPA,, etc. PhD in Modeling and Simulation 17

http://cyber.ist.ucf.edu/ Thank you! Wingyan Chung, Ph.D. Institute for Simulation and Training Email: wchung@ucf.edu; Tel.: (407) 882-1435