Text Mining and its Applications to Intelligence, CRM and Knowledge Management Editor A. Zanasi TEMS Text Mining Solutions S.A. Italy WITPRESS Southampton, Boston
Contents Bibliographies Preface Text Mining: a new technology paradigm? xix xxvii Part 1: THEORETICAL OVERVIEW Chapter 1: Text processing and information retrieval M. Milic-Frayling 1 1 Introduction 1 2 Data gathering and extraction of text 3 2.1 Encoding of textual information 3 2.2 Document formats and markup languages 4 2.3 Data collection in distributed hyperlinked environments 4 2.3.1 Web crawling 5 2.3.2 Distributed architecture - 'Harvest' 6 3 Text processing 6 3.1 From grammar rules to statistical NLP 6 3.2 Basic linguistic concepts 8 3.3 Text processing 11 3.3.1 Word segmentation 11 3.3.2 Word conflation and disambiguation 12 3.3.3 Part of speech tagging 15 3.3.4 Parsing 16 4 Information retrieval 16 4.1 Basic concepts 18 4.2 Indexing 20 4.2.1 Feature selection 22 4.2.2 Feature weighting 23 4.3 Retrieval models 23 4.3.1 Boolean retrieval 24 4.3.2 Vector space retrieval.-. 25 4.3.3 Probabilistic information retrieval 27 4.4 Search refinement with user relevance feedback 28 4.5 Evaluation 29 4.5.1 Evaluation measures 29 4.5.2 TREC - Large scale evaluation initiatives 31 5 Concluding remarks 39
Chapter 2: Information extraction and... Surroundings M.T.Pazienza 47 1 Introduction 47 1.1 What is information retrieval (IR)? 48 1.2 What is information extraction? 49 1.3 What is full text understanding? 49 1.4 What is mining? 49 1.5 Is there a difference between information extraction from web documents and traditional ones? 50 2 Information extraction historical flash-back 51 3 IE systems architecture 53 3.1 Text zoning 54 3.2 Pre-processor 54 3.3 Filter 54 3.4 Pre-parser 54 3.5 Parser 54 3.6 Fragment combination 54 3.7 Semantic interpretation 54 3.8 Lexical disambiguation 55 3.9 Co-reference resolution, discourse processing 55 3.10 Template generation 55 4 Features of an IE system 55 4.1 The parsing role 61 4.2 The scenario's role 62 5 Adaptive IE systems 63 5.1 Machine-learning for information extraction 64 6 IE systems: a few European applications 64 6.1 NAMIC 65 6.1.1 Large scale IE for automatic authoring 65 6.1.2 The role of a world model as a method for event matching and co-referencing 65 6.1.3 Named entity matcher 66 6.1.4 Discourse processor 66 6.1.5 Ontological and lexical information 66 6.1.6 The NAMIC architecture 66 6.2 CROSSMARC 68 6.2.1 The CROSSMARC IE component 68 6.2.2 Cross-lingual named entity recognition and classification 68 6.2.3 CROSSMARC fact extraction 69 6.2.4 CROSSMARC fact definition 70 6.2.5 CROSSMARC ontology 70 Chapter 3: Text clustering as a mining task F. Mandreoli, R. Martoglia & P. Tiberio 75 1 Introduction,. 75 2 Overview on data clustering analysis 78
2.1 Similarity measures 78 2.2 Clustering techniques 79 2.2.1 Single-link and complete-link hierarchical methods 81 2.2.2 K-means partitional methods 82 3 Problems and solutions in the text clustering field 84 3.1 Effective extraction of meaningful features from plain texts 84 3.2 Effective treatment of high dimensionality 84 3.3 Interpretability of results 84 3.4 Efficiency and scalability of the clustering process 84 3.5 Feature selection and reduction 85 3.6 Efficient clustering of large unstructured data sets 88 3.6.1 K-Means clustering variants 88 3.6.2 Relational data analysis (RDA) clustering 91 3.6.3 New clustering approaches from the DataBase community 92 3.6.4 Document-specific clustering approaches 95 3.6.5 A short note about memory management and distribution 96 3.6.6 Comprehension and navigation of clustering output 98 3.7 Clustering web documents 99 4 Conclusions 104 Chapter 4: Text categorization F. Sebastiani 109 1 Introduction 109 2 The basic picture 110 2.1 Document indexing 112 2.2 Classifier learning 113 2.3 Classifier evaluation 113 3 Techniques 115 3.1 Document indexing techniques... 115 3.2 Classifier learning techniques 116 3.2.1 Support vector machines 116 3.2.2 Boosting 117 4 Applications 117 4.1 Automatic indexing for Boolean information retrieval systems 118 4.2 Document organization 118 4.3 Text filtering 119 4.4 Hierarchical categorization of web pages 119 4.5 Word sense disambiguation 120 4.6 Automated survey coding 120 4.7 Automated authorship attribution and-genre classification 120 4.8 Spam filtering 121 4.9 Other applications 122 5 Conclusion 122 6 Notes 123
Chapter 5: Summarization and visualization D. Mladenic & M. Grobelnik 131 1 Introduction 131 2 Text summarization 132 2.1 Keywords 133 2.1.1 Extracting keywords from text 133 2.1.2 Keyword assignment using document categorization 134 2.2 Sentence extraction 135 2.3 Abstract generation 137 3 Text visualization 138 4 Example of summarization of a document set 139 5 Future directions of research and applications 140 Part 2: APPLICATIONS Chapter 6: Application integration in applied text mining D.Sullivan 145 1 Introduction 145 2 Business drivers and application types 146 2.1 Customer transaction analysis 146 2.2 Competitive intelligence 147 2.3 Research and development support 147 3 Application elements 148 3.1 Content acquisition 148 3.1.1 Internal content acquisition 149 3.1.2 External content source 149 3.1.3 Rights management 149 3.2 Pre-processing 150 3.3 Linguistic analysis 151 3.3.1 Term co-occurrence 151 3.3.2 Entity extraction 151 3.3.3 Information extraction 151 3.4 User analysis 152 3.5 Content repository 152 3.6 Security and access controls 153 4 Conclusions 153 Chapter 7: ROI in text mining projects M.Ferrari......155 1 Introduction 155 2 The evaluation of a text mining solution 157 3 The evaluation of the tangible components 158 3.1 The evaluation of the tangible components in a text mining project: an example 160 4 The evaluation of the intangible components 163 4.1 The value chain Scoreboard 164
4.2 The Intangible Asset Monitor and The Skandia Navigator 167 4.3 The balanced scorecard 171 5 Conclusions 177 a) INTELLIGENCE Chapter 8: Open sources automatic analysis for corporate and government intelligence A. Zanasi 185 1 Introduction 185 2 New government intelligence role 186 2.1 New challenges to the market state 186 2.2 New intelligence cycle 186 2.3 A help from corporate intelligence 187 3 Corporate intelligence 187 3.1 Competitive intelligence definitions 187 3.2 CI questions 187 3.3 Where are the answers? 188 4 Open sources 188 4.1 Definition 188 4.2 Internet data 189 4.3 Hosts 189 4.4 Online databanks 189 4.5 Proprietary sources 190 4.6 The open sources analysis problems 190 4.7 Other needs: forecasting and early warning systems 191 5 Terrorism and other challenges to government intelligence 192 5.1 Introduction 192 5.2 Homeland security: DARPA and HSARPA vision 192 5.2.1 Information, an arm against the asymmetric threats 193 5.2.2 EELD program 193 5.2.3 TIDES program 193 5.3 Anti-terrorism tasks requiring analysis of a large quantity of text 193 5.3.1 High tech terrorism 193 5.3.2 Names and relationships detection 194 5.3.3 Vindications analysis 194 5.3.4 Info spam 194 5.3.5 Identifying lobbying 194 5.3.6 Monitoring a specific market sector 194 5.3.7 Money laundering 194 5.3.8 Insider trading 194 6 Practical examples of text mining applied to the intelligence process 196 6.1 Characteristics of high quality intelligence 196 6.2 The modules to implement the intelligence process 196
6.2.1 Business discovery 197 6.2.2 Solution definition 197 6.2.3 Research strategy 197 6.2.4 Analysis 198 6.2.5 Results analysis and interpretation 199 6.3 The reachable objectives 203 7 Business cases 203 7.1 Forecasting competitor actions 203 7.2 Detecting competitor action in market 203 7.3 Alliances detection 203 7.4 Business opportunities detection 203 7.5 Predicting biowarfare agents 206 7.6 Supply chain management and purchasing activity 206 7.7 Military strategy 206 7.8 Extracting terminologies 207 7.9 R&D activity detection ; 207 Chapter 9: A critical appraisal of text mining in an intelligence environment A.Politi 209 1 Introduction 209 2 11 Sept., intelligence and information explosion 209 3 Data mining: some world relevant examples 211 4 Data mining, the intelligence cycle and decision 214 Chapter 10: Marketing intelligence system to forecast telecommunications competitive landscape S.de'Rossi 219 1 Introduction 219 2 Italian mobile market overview 220 3 TIM positioning 220 4 From competitive to market intelligence 221 5 Our needs 222 5.1 The business model 222 5.2 The intelligence needs 223 5.3 The information source 223 5.4 Building up the system 224 Chapter 11: Competitive intelligence for SMEs: An application to the Italian building sector G. Casoni 227 1 What was the problem 227 2 Edilintelligence: what is it? 229 3 The text mining bricks of the solution: Theory and practice 231 4 Conclusions 234
b)crm Chapter 12: Virtual communities: human capital and other personal characteristics extraction A. Zanasi 237 1 The emergence of neo-renaissance paradigm 237 2 Intellectual and human capital 238 2.1 The real wealth 238 2.2 Intellectual capital taxonomy 238 3 Virtual communities: where text mining is applied 239 3.1 Community structuring 239 3.2 Participant interaction and access 240 3.3 Content management 240 3.4 Community leveraging 240 4 Human capital in customer communities 241 5 Human capital in employee community 241 5.1 An example of human capital: Employee attitudes 241 5.2 Vital signs monitor 242 5.3 VSM key concepts definition 243 6 Human capital in social contexts 244 6.1 Defining anonymous terrorist authorship 244 6.2 Digital signatures 244 6.3 Lobby detection 245 6.5 Monitoring of specific areas/sectors 245 6.6 Chatlines and other open sources analysis 245 7 Social network links detection 246 7.1 Social structure 246 7.2 Graphical representation of connections 246 Chapter 13: Customer feedbacks and opinion surveys analysis in the automotive industry L. Grivel 249 1 Introduction 249 2 Customer feedback analysis in Renault 250 2.1 Objectives and problem description 250 2.2 Analysis 251 2.3 The technology 252 2.3.1 Information extraction 252 2.3.2 Skill cartridges 252 2.3.3 An ontology for CRM applications 252 2.4 Solution 252 2.5 Feedback 253 3 Opinion surveys for automotive manufacturers 253 3.1 Objectives and problem description 253 3.2 Analysis 253 3.3 Technology 254
3.3.1 General dictionaries and rules 254 3.3.2 Automotive specific dictionaries and rules 255 3.4 Implemented solution 255 3.5 Feedback 255 4 Conclusions 257 Chapter 14: The Responsio email management system M. Kockelkorn & T. Scheffer 259 1 Introduction 259 2 Email answering by semi-supervised text classification 260 3 Responsio email management system 261 4 Case study 262 5 Discussion 263 Chapter 15: TV channel provider: mining the user feedback L.K. Wives, S. Loh, J.L. Duizith&J.P. Moreira de Oliveira 265 1 Introduction 265 2 The case 265 3 The process 266 4 Conclusion 268 c) KNOWLEDGE MANAGEMENT Chapter 16: Text mining based knowledge management in banking K. Lebeth, M. Lorenz&U. Storl 271 1 Introduction 271 2 The document as a primary source 272 3 Knowledge based search 272 4 Building up a knowledge management infrastructure 272 5 Integrating principles 273 6 Modules 274 6.1 Term extractor 274 6.2 Knowledge Net 274 6.3 Automated metatagging engine 276 7 Conclusion and future work 277 Chapter 17: Text mining in life sciences J. Fluck, H. Deneke&C. Gieger 279 1 Introduction 279 2 Text mining - current state 280 2.1 Methodical development 280 2.2 Applications in life sciences 281 3 Ontology development, 282 4 Conclusion 283
Chapter 18: Information search and classification to foster innovation in SMEs The AREA Science Park experience F.Neri 285 1 The AREA Science Park and its technology transfer division 286 2 TEMIS online miner light, the TTD search engine for patents (TTDSE) 287 2.1 Data selection 287 2.2 TTDSE back-end: the knowledge extractor 287 2.3 TTDSE front-end: the advanced search engine 288 3 TTD results 289 Chapter 19: Media industry: how to improve documentalists efficiency G.Peters 293 1 Introduction 293 2 Text data production in media 293 3 Indexing textual data 294 4 Archive solutions: data bases and automatic procedures 295 5 Text mining experience in Gruner + Jahr 295 5.1 Overview 295 5.2 The need 296 5.3 Performance measures 296 5.4 Extraction 296 5.4.1 Personal names 296 5.4.2 Organization names 296 5.5 Utilization: lessons leamt 297 5.5.1 Customization 297 5.5.2 Training 297 5.5.3 Documentalists 297 5.5.4 Savings 297 6 Conclusion 297 Chapter 20: Link analysis in crime pattern detection S.Ananyan 299 1 Introduction 299 2 Case overview 299 3 Implementation approach 300 4 Data preprocessing 301 5 Structured data analysis 302 6 Concept extraction 302 7 Pattern analysis '. 304 8 Drill-down and reporting 304 9 Drill-down and reporting 307 10 Automation 313 11 Conclusion 313
Part 3: SOFTWARE Chapter 21: Text mining tools A. Zartasi 315 1 Megaputer intelligence 315 1.2 Company description 315 1.3 Products 315 1.4 Incorporation of domain knowledge 316 1.5 Exporting discovered knowledge 316 1.6 Supported languages 316 1.7 IT requirements 317 1.8 Customer base 317 1.9 Partners 317 1.10 Supported applications 317 2 SAS 317 2.1 Company description 317 2.2 Product 318 2.3 Incorporation of domain knowledge 318 2.4 Supported languages 318 2.5 IT requirements 318 3 SPSS 319 3.1 Company description 319 3.2 Product functionality 319 3.3 Incorporation of domain knowledge 319 3.4 Exporting discovered knowledge 319 3.5 Supported languages 319 3.6 IT requirements 320 3.7 Marketing information 320 3.7.1 LexiQuest Mine a text mining application 320 3.7.2 LexiQuest Categorize a categorization engine 320 3.7.3 Text Mining for Clementine an add-on to the data mining suite 320 3.8 Customer base 320 4 Synthema 321 4.1 Company description 321 4.2 Product 321 4.3 Incorporation of domain knowledge 321 4.4 Exporting discovered knowledge 321 4.5 Supported languages 321 4.6 IT requirements 321 4.7 Customer base 322 5 TEMIS 322 5.1 Company description 322 5.2 Products 322 5.2.1 Insight discoverer clusterer (IDC) 322 5.2.2 Insight discoverer categorizer (IDK) 323
5.2.3 Insight discoverer extractor (IDE) and skills cartridges (SQ.323 5.2.4 Online miner (OM) 324 5.2.5 Xelda 324 5.3 Incorporation of domain knowledge 325 5.4 Supported languages 325 5.5 IT requirements 325 5.6 Customer base 325 5.7 Partners 325 Others 325 6.1 Autonomy 325 6.2 Clearforest 326 6.3 Convera 326 6.4 Entrieva 326 6.5 Fast 326 6.6 IBM 326 6.7 Insightful 326 6.8 Inxight 327 6.9 Verity 327