WHITEPAPER. Text Analytics Beginner s Guide

Similar documents
Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Voice of the Customer: How to Move Beyond Listening to Action Merging Text Analytics with Data Mining and Predictive Analytics

How to Optimize Your Data Mining Environment

KnowledgeSEEKER Marketing Edition

WHITEPAPER. How to Credit Score with Predictive Analytics

FundGUARD. On-Demand Sales and Marketing Optimization for Mutual Funds and Wealth Management

WHITEPAPER. Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk

Social Media Implementations

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Bigger Data for Marketing and Customer Intelligence Customer Analytics Roadmap

Product recommendations and promotions (couponing and discounts) Cross-sell and Upsell strategies

how to gain insight from text

What is a Social Media Playbook, and Why Do I Need One?

Reputation-Management Tutorial. Presented By: Brian-Angeli.com Partner With Brian

How To Choose Help Desk Software For Your Company

Text Mining - Scope and Applications

Get results with modern, personalized digital experiences

Agile speech analytics: a simple and effective way to use speech analytics in contact centres

Introduction to Text Mining and Semantics. Seth Grimes -- President, Alta Plana

Best practices for evaluating and selecting content analytics tools

At Your Service: Your Roadmap to Support from SAS

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

Government Management Committee. P:\2013\Internal Services\I&T\gm13005I&T (AFS # 17768)

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

KPMG Unlocks Hidden Value in Client Information with Smartlogic Semaphore

Top 10 best practices that savvy marketers know about

AN INTRODUCTION TO LEAD GENERATION.

STAR WARS AND THE ART OF DATA SCIENCE

January/February Foresight Report

Content Analyst's Cerebrant Combines SaaS Discovery, Machine Learning, and Content to Perform Next-Generation Research

Salesforce ExactTarget Marketing Cloud Radian6 Mobile User Guide

A Quick Start Guide On How To Promote Your Site Using Do It Myself SEO

A full spectrum of analytics you can get yourself

WordPress SEO Secrets Revealed

AN INTRODUCTION TO LEAD GENERATION.

Discover How a 360-Degree View of the Customer Boosts Productivity and Profits. eguide

SPRING 14 RELEASE NOTES

Visualization methods for patent data

Optimizing Your Magento Store for best SEO Practices

A Whitepaper of Marketing Questions and Answers Creating Landing Pages that Sell (with Bob Bly)

{ { Calculating Your Social Media Marketing Return on Investment. A How-To Guide for New Social Media Marketers. Peter Ghali - Senior Product Manager

BUILDING A HOLISTIC MARKETING STRATEGY

WHITE PAPER. CRM Evolved. Introducing the Era of Intelligent Engagement

SOCIAL MEDIA MEASUREMENT: IT'S NOT IMPOSSIBLE

Beyond listening Driving better decisions with business intelligence from social sources

THE ULTIMATE BEGINNER S GUIDE TO ECOMMERCE SEO

Quick Guide to Getting Started: Twitter for Small Businesses and Nonprofits

Big Data. Fast Forward. Putting data to productive use

Search Engine Optimization

Market Research with Social Media

! Hints & Tips

T he complete guide to SaaS metrics

Multichannel Customer Listening and Social Media Analytics

8 TIPS FOR MAKING THE MOST OF GOOGLE ANALYTICS. Brought to you by Geary LSF and Orbital Informatics

Microsoft OneNote. Presented by Ben M. Schorr OM42 5/22/2014 2:15 PM - 3:15 PM. May 19-22, 2014, Toronto ON Canada

What to do Post Google Panda

Social Media and Content Marketing.

The Guide to: Marketing Analytics"

Multichannel analytics and discovery

Leveraging Global Media in the Age of Big Data

THE STATE OF Social Media Analytics. How Leading Marketers Are Using Social Media Analytics

Content Marketing Integration Workbook

6 Steps to creating a Cross Channel Communications Roadmap

Contents. Meltwater Quick-Start Guide

Our unique perspective on brand and comms tracking

Predictive Analytics Software Suite

Online Reputation Management Services

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

10 Actionable SEO Tips for Small Businesses

the beginner s guide to SOCIAL MEDIA METRICS

Moreketing. With great ease you can end up wasting a lot of time and money with online marketing. Causing

Monitoring the Social Media Conversation: From Twitter to Facebook

Connecting library content using data mining and text analytics on structured and unstructured data

A U T H O R S : G a n e s h S r i n i v a s a n a n d S a n d e e p W a g h Social Media Analytics

A-LINE S GUIDE TO TWITTER. Using Twitter to increase your brand s visibility and customer interaction

Marketing Analytics What needs to Be Measured

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

STRATEGY MARKETING. Target MANAGEMENT VISION. Effective app store marketing strategies for your mobile VoIP app

Take Online Lead Generation to the Next Level

SAP Digital CRM. Getting Started Guide. All-in-one customer engagement built for teams. Run Simple

Transcription:

WHITEPAPER Text Analytics Beginner s Guide

What is Text Analytics? Text Analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation Wikipedia or Given a collection of text, text analytics tells you who, where, when, what, and how so that you can figure out why Lexalytics Text Analytics (or text mining to some), in a nutshell, is about teaching a computer to read. No, it s not Optical Character Recognition (OCR), it s about understanding what a piece of text is trying to tell you the meaning of the words, and then deciphering how that meaning affects your business. In your journey through Text Analytics, you re going to run into a few concepts over and over again, and the purpose of this (deliberately short) white paper is to give you some background on what those things actually mean, including: Named Entity Recognition: Who, where, when is being discussed in the content? Classification/Tags: What are the important concepts that are being used? Theme Extraction: What are the important words that are being used? Summarization: What is the short version of the most important text? Sentiment Analysis: How is the conversation occurring? Is the conversation positive or negative for Entities, Themes, and the Classifiers/Tags? Named Entity Recognition: Who, where, when Named Entity Recognition (sometimes called Named Entity Extraction ) is the ability to extract and assign meaning to items like the names of individuals, companies, products and places. These are generally proper nouns and constitute the who and where. However, there are lots of other sorts of Named Entities that can be interesting like addresses, dates, phone numbers and website URL s. Being able to automatically extract these can be important depending on what you re trying to accomplish. Depending on the system, you can tell it what you re looking for ahead of time, or the system can have some sort of statistical model that it uses to find things that look like a company, for instance. While a company name and a person name are both proper nouns, they are referred to differently in text, and machines can make pretty good guesses as to which type a certain name is, whether it is a place (Hilton Head), a person (Paris Hilton), or a company (Hilton Hotels). 2

The guess that the computer makes is often based around Part of Speech patterns or PoS patterns. So, if you hear someone talking about PoS tagging (pronounced paws ), that s what they re referring to. If you remember middle school English class, then you probably remember sentence diagrams that s PoS tagging. Companies, people and places are generally referred to with different PoS patterns.there are often times more clues than just PoS patterns, but they are the foundation of named entity recognition in many different systems. You can also look for any arbitrary list of words, or train a system to look for different kinds of entities (like diseases, or types of trees). The advantage to doing this is so that you can associate other things with any entities that are found (such as sentiment or themes). We ll discuss both of these later. Consider the following example. I ve bolded the entities in this article: Yahoo wants to make its Web e-mail service a place you never want to or more importantly have to leave to get your social fix. The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client. Entity Yahoo Twitter Facebook U.S. Type Company Company Company Place Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive. http://edition.cnn.com/2010/tech/web/10/27/yahoo.faster.email.cnet/index.html 3

Classification/Tags: What are the important concepts that are being used? Classification is the ability to recognize when a piece of text mentions a particular concept. If you want to look for any Tweet that mentions food or if you want to find any article that discusses politics, you need a classifier. Classifiers have scores associated with them so that you can determine whether an article is truly about food or if it just mentions some sort of food-related concept. Let s take that same piece of content from above: Yahoo wants to make its Web e-mail service a place you never want to -- or more importantly -- have to leave to get your social fix. The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client. Concept Score Software and Internet.56 Social Media.60 Technology.49 Business.72 Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive. You can see that the concepts the classifier picks up on are basically what you d expect them to be, so that you could then easily run analysis of your company and how it was mentioned in the context of for instance, Technology. 4

Theme Extraction: What are the important words that are being used? Themes are the important noun phrases in the text. In other words, Themes tell you exactly the phrases that are being used in a conversation. They differ from Classifiers in that Themes tell you the exact phrases, where Classifiers tell you the broad topics. Themes are really useful for discovery purposes say you have a bunch of classifiers set up, but suddenly a new topic of conversation bubbles up you won t catch that with your classifiers because they have to be configured ahead of time, but Themes will allow you to actually see that there is a new aspect to the conversation that may be important to consider. Yahoo wants to make its Web e-mail service a place you never want to -- or more importantly -- have to leave to get your social fix. Theme Cloud computing technology Score The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, Top users requests 2.669 while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client. Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive. This article produces more themes than these top three, but even these top three give a good feel for what this article is about, and when combined with lots of other articles can generate a very good understanding of the most important topics of conversation. 4.11 E-mail Services 2.672 5

Summarization: What is a shorter form of the document being processed? Summaries are conceptually simple just a reduction in the amount of text, so that you can get to the real meat of the article. Summaries are usually not used for analytical purposes; instead they are used to scan through articles to get a general feel for what s going on. They really shine when you apply them to 200 page research reports. It s important to note that you can have different summaries for a single document in the case of a large analyst report that mentions 20 different companies, you could make a different summary for each company. Then you could aggregate the summaries for 50 different (large) analyst reports, all for that same company and get a really good feel for what s going on with that company without having to read 10,000 pages of analyst report. Sentiment Analysis: How is the conversation occurring? Is the conversation positive or negative for Entities, Themes and the Classifiers/Tags Sentiment Analysis is the ability for a machine to tell you whether a conversation is positive or negative, and by how much. Sentiment can be applied at the document level or at the Entity, Theme or Classifier level. It is most useful when applied to specifics, rather than to the document as a whole for example if an article is positive for one company but negative for another, you don t want to see that the conversation is neutral on average, but rather that the negative and positive sentiments are appropriately attributed. With phrase-based sentiment, there will often be phrases that you disagree with in a single article. What is important is that they add up to a score that agrees with your interpretation of reality. The advantage to phrase-based sentiment is that it is completely transparent and allows you to gauge sentiment on a vertical-by-vertical basis without having to retrain a model. There has been much discussion with respect to accuracy in sentiment. The University of Pittsburgh ran a study that subjected graduate students through 40 hours of training on how to code for sentiment. They found, after that training, that the students agreed roughly 80% of the time. This revealed that sentiment has a strong personal bias even with rigorous training. So, for machine-based sentiment analysis, 80% agreement with a human rater is pretty much the upper bound. More realistic numbers are typically in the high 60 s to mid 70 s when looking at a broad content set. If you re looking at a narrow vertical, you can tune sentiment to be more accurate than if you re looking broadly across all industries it all comes down to the differing use of language in different spaces. 6

Lexalytics uses a phrase-based methodology, other companies use statistical models, but they all have the same goal. Using the previous article (repeated below), positive or negative phrases are highlighted in blue or red, respectively. Yahoo wants to make its Web e-mail service a place you never want to -- or more importantly -- have to leave to get your social fix. The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client. Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive. As mentioned earlier, the sentiment can be applied to each of the entities, classifiers, and themes; so here they are: Entity Sent Yahoo.534 Twitter.48 Facebook.534 Concept Score Software and Internet 0.0 Social Media.48 Technology.49 Business.49 Theme Cloud computing technology Score 4.3 E-mail Services 3.8 Top users requests 3.5 Summary In short, text analytics/text mining provides you with the tools necessary to extract meaning from collections of text telling you who, what, where, when, and how so that you can figure out why it s happening and what you can do to change it or leverage it. 7

About Angoss Software As a global leader in predictive analytics, Angoss helps businesses increase sales and profitability, and reduce risk. Angoss helps businesses discover valuable insight and intelligence from their data while providing clear and detailed recommendations on the best and most profitable opportunities to pursue to improve sales, marketing and risk performance. Our suite of desktop, client-server and big data analytics software products and Cloud solutions make predictive analytics accessible and easy to use for technical and business users. Many of the world's leading organizations use Angoss software products and solutions to grow revenue, increase sales productivity and improve marketing effectiveness while reducing risk an cost. About Lexalytics, Inc. Lexalytics, Inc. is a software and services company specializing in text and sentiment analysis for social media monitoring, reputation management and entity-level text and sentiment analysis. By enabling organizations to make sense of the vast content repositories on sources like Twitter, blogs, forums, web sites and in-house documents, Lexalytics provides the context necessary for informed critical business decisions. Serving a range of Fortune 500 companies across a wide spectrum, Lexalytics partners with industry leaders such as Endeca, ThomsonReuters, Radian 6 and TripAdvisor to deliver the most effective sentiment and text analysis solutions in the industry. Corporate Headquarters 111 George Street, Suite 200 Toronto, Ontario M5A 2N4 Canada Tel: 416-593-1122 Fax: 416-593-5077 European Headquarters Surrey Technology Centre 40 Occam Road The Surrey Research Park Guildford, Surrey GU2 7YG Tel: +44 (0) 1483-685-770 www.angoss.com 8 Copyright 2012. Angoss Software Corporation www.angoss.com