Projektgruppe. Information Extraction An Incomplete Overview
|
|
|
- Erick Harrell
- 10 years ago
- Views:
Transcription
1 Projektgruppe Henning Wachsmuth Information Extraction An Incomplete Overview 12. Mai
2 Einführungsvorträge Verfassen von Seminarvortrag und paper Prof. Dr. Gregor Engels, Donnerstag 15.4., 16h-18h Interactive Knowledge Semantische CMS-Systeme Fabian Christ, Donnerstag 29.4., 16h-18h Maschinelles Lernen Dr. Theodor Lettmann, Freitag 7.5., 15h-17h Information Extraction Henning Wachsmuth, Mittwoch 12.5., 16h-18h Aktuelle Probleme des Software Engineering Benjamin Nagel, Donnerstag 20.5., 16h-18h Plagiatanalyse Prof. Dr. Benno Stein, Bauhaus-Universität Weimar, Freitag 28.5., 14h-16h Projektmanagement Stefan Sauer, Geschäftsführer s-lab 2
3 Motivation: Automatic market forecasting 3
4 Motivation: Trends and opinions 4
5 Outline What is IE? [1] Introduction, definitions and applications [2] Techniques A common IE pipeline and its algorithms Evaluation [3] Why, what and how to evaluate [4] Outlook Problems, current state and conclusion 5
6 What is Information Extraction (IE)? Information Extraction is a technology that is futuristic from the user's point of view in the current information-driven world. (from the website of the National Institute of Standards and Technology) Some marketing from Open Calais... 6
7 A little more precise... Information Extraction is a technology based on analysing natural language in order to extract snippets of information. (Hamish Cunningham, University of Sheffield) Recovering structured data from formatted text, i.e., identifying fields, understanding relations, normalization and deduplication. (Kamal Nigam, Google Pittsburgh) Filling slots in a database from sub-segments of text. IE = segmentation + classification + association + clustering. (William W. Cohen, Carnegie Mellon University) 7
8 What isn t? Google (search) is not IE Search is Information Retrieval! Next slide... Text understanding is not IE IE is only interested in specific parts of the text IE is usually much more efficient and realistic at present Inferring patterns and information from databases isn t. That s Data Mining. 8
9 Retrieval vs. Extraction Information Retrieval (IR): Index and search Returns relevant documents to a query based on similarity measures and statistics YOU analyze the documents! Advantage: widely applicable, problem: sometimes still too much information Information Extraction (IE): Find and structure Returns facts and specific information of predefined types YOU analyze the facts! Advantage: allows for automatic aggregation, problem: often domain-specific 9
10 Related fields and sciences 10
11 Applications Entity search Faceted search Media monitoring Tourism offers Job Descriptions Trend analysis and opinion mining Market forecasts Health care delivery Monitor terrorist activities... 11
12 Outline revisited What is IE? [1] Introduction, definitions and applications [2] Techniques A common IE pipeline and its algorithms Evaluation [3] Why, what and how to evaluate [4] Outlook Problems, current state and conclusion 12
13 Extraction: example LOEWE AG: VORLÄUFIGE NEUN-MONATS-ZAHLEN 2007 Kronach, 6. November 2007 Auf vorläufiger Basis konnte das Ergebnis des Loewe Konzerns in den ersten neun Monaten deutlich verbessert werden. (...) Beim Umsatz strebt der Chef des Konzerns, Rainer Hecker, für das Jahr 2007 weiterhin ein Wachstum von 10 % auf ca. 380 Mio. Euro an. Loewe AG: T = ( , ) M = (380.0, 10.0) t = ( ) A = ( Rainer Hecker ) Loewe AG: Revenues in million 13
14 Rule-based vs. statistical approaches Knowledge engineering based on language experience Make use of human intuition Only few training data needed Often very time consuming handcrafted development Changes may be hard to accomodate Use Machine Learning techniques Results may be counterintuitive Large amounts of annotated training data needed Main work is feature engineering Changes often easy (but sometimes require reannotation) 14
15 High-level IE pipeline Normal IE tasks may consist of several steps, which are sequentially executed An IE pipeline can look like the following: (Note that the segmentation and naming of steps differ in literature, so take the given pipeline as an example) 15
16 Content Extraction Content Extraction is the process of determining the parts of a document which contain the main textual content Common example: finding the main text of an HTML document For semi-structured documents, approaches to this task mostly rely on analyzing the underlying markup structure Instead of syntactic trees, often only heuristics on the density are used Passages with content often consists of very few tags from spinn3r.com 16
17 Text preprocessing Many Information Extraction algorithms are based on the results of syntactic and linguistic preprocessing, namely: Tokens and sentences Stems, lemmata and part-of-speech of tokens Phrase structure of sentences More on this in the talk of Michael Meier 17
18 Named Entity Recognition (NER) Named entities are nameable objects in the world Traditionally, IE concentrates on persons, organizations, and locations, but many categories are possible (e.g. markets, etc.) NER seeks to find all entities in a text that belong to predefined categories NER is the most popular IE task and has many applications Most state-of-the-art systems use sequential classifiers based on Machine Learning techniques: Die Loewe AG erzielt zum 3. Mal in Folge mehr Umsatz. O B-ORGI-ORG O O O... More on this labeling in the talk of Enes Yigitbas More on NER in the talk of Michael Meier 18
19 Numeric Entity Recognition* Numeric entities can be time and money expressions, quantities, percentages, and the like Sometimes also counted as named entities Some standards distinguish 30 classes of number expressions Numeric Entity Recognition can be done with Sequence Labeling or with rule-based approaches In InfexBA, we found more than 96% of all time and money information with regular expressions Time Money * also often only called Temporal Expression Recognition (TER) 19
20 Coreference Resolution Resolving coreferences is the task of finding and unifying different identifiers for the same referent Anaphoric: relations between pronouns (etc.) and names Proper-noun: different spellings and compoundings of the same object Example from above: LOEWE AG: VORLÄUFIGE NEUN-MONATS-ZAHLEN 2007 Kronach, 6. November 2007 Auf vorläufiger Basis konnte das Ergebnis des Loewe Konzerns in den ersten neun Monaten deutlich verbessert werden. (...) Beim Umsatz strebt der Chef des Konzerns (...) Both hand-crafted and statistical approaches used in literature: Identifying same head words, considering noun phrase types, distance features... 20
21 Relation Extraction and Event Detection Relation Extraction (RE) is on the recognition of relationship between entities Event Detection seeks for the events entities take part in Both tackle the identification of certain relations and classifying roles of entities: Identification can be based on supervised learning (bag of words, etc.) Syntatic parsing may then be used to determine roles Since parsing is quite inefficient, we need to minimize the set of candidate relations: Example: Discard the following sentence, because it s not on revenue! Auf vorläufiger Basis konnte das Ergebnis des Loewe Konzerns in den ersten neun Monaten deutlich verbessert werden. 21
22 Template Filling Compose all information that refers to an instance of the scenario of interest Use entities and relations from the previous stages Maybe find further attributes, properties etc. (with the help of word lists, classifiers, etc.) Example: In InfexBA, we do the following: 1. Extract named entities, time and money information 2. Identify sentences that represent statements on revenue 3. Determine type of statement and the relation between time and money 4. Find subject, scope, statement time, etc. 22
23 Normalization Resolve in-text references Example: Kronach, 6. November 2007 Das Ergebnis des Loewe Konzerns konnte in den ersten neun Monaten deutlich verbessert werden. (...) Transform numeric expressions into a normalized form Example: um 3 Millionen auf 21 Millionen Euro (21.0, 16.67) Master thesis possible! Example: in den ersten neun Monaten 2007 ( , ) Match and unify synonymous entity identifiers Example: Loewe and Loewe AG ( Loewe AG ) 23
24 Aggregation With aggregation, we mean merging, deduplication, and inference on the normalized information Example: Market Forecast Information Chronological merge values, compute average and deviations Infer information on given topic from information on subtopics e.g., from companies to markets Bachelor thesis possible. Tell your friends! 24
25 Visualization Once normalization and aggregation is done, visualization is straightforward Loewe AG Dez 06: ( , ): 345,0 Mio. ( , ): 380,0 Mio Dez 04: ( , ): 310,0 Mio. Aug 02: ( , ): 450,0 Mio. Bachelor thesis possible. Tell your friends! Aug Dez 06 Loewe AG: past/predicted revenues 25
26 Text classification in IE Besides sequential classifiers, the classification of texts may be connected to Information Extraction Techniques like genre analysis or opinion mining are related to IE in that they often include the extraction of specific information: Technology Top Plasma-Fernseher zu günstigem Preis. Geringer Umfang zwar, aber Bildqualität hervorragend. Da kann man nicht meckern. positive? negative 26
27 Outline, the 3rd What is IE? [1] Introduction, definitions and applications [2] Techniques A common IE pipeline and its algorithms Evaluation [3] Why, what and how to evaluate [4] Outlook Problems, current state and conclusion 27
28 Evaluation Discover points of failure Prove that your ideas are good Why evaluate? Decide between alternative methods Tune / Optimize your system 28
29 How to evaluate? Some recommendations... Clarify what you want to show: efficiency vs. exactness vs. completeness vs. the best algorithm vs. proof of concept vs.... Can you make your results comparable? Design your experiments in a scientific way: Use test sets different from your training data, and make explicit what you ve done! Could someone else reproduce your results from your documentation? How many variables are there?... Decide how to measure: performance measures, correctness criteria, scope of evaluation (e.g. single vs. overall algorithms)... Often standard measures such as precision and recall are used (next slide...) What is a match (exact vs. overlap)? Is there only one correct text span? 29
30 Performance measures: Precision and Recall False Positives True Positives True Negatives Precision (P) = TP / (TP + FP) False Negatives Recall (R) = = TP / (TP + FN) F-Measure = (2 P R) / (P+R) Precision is a measure for the exactness of your algorithm: How many of the found objects are correct? Recall is a measure for the completeness of your algorithm: How many of the target objects have been found? 30
31 Evaluation: example 200 target objects 500 objects 95 found correctly 100 found Precision = 95 / (95+5) = 0.95 Recall = 95 / (95+105) = Confusion Matrix F-Measure = ( ) / ( ) = 0.62 Target Other Found 95 (TP) 5 (FP) Not found 105 (FN) 295 (TN) 31
32 What to do with your results? Analyze your false positives and false negatives Can you avoid some false extractions or classifications and can you find some of the missed true positives? In consequence, introduce better features/rules and redo experiments In case results are new, innovative, best ever seen, scientifically important, or whatever: publish them ;-) Choose the best of your alternatives In theory, the one that fits best to the problem at hand In practice, the best-performing algorithm even if it s simple 32
33 State-of-the-art State-of-the-art NER systems achieve near-human performance: F-Measure > 93% (humans ~ 96%) For English only! In German: F-Measure < 80%... WHY? Your task: Improve German state-of-the-art ;-) It s all about feature engineering... Coreference Resolution: Results very domain-dependent, in hard cases only 50%-60% may be relied upon Relation Extraction: Good Systems have F-Measure > 75% Template Filling: Systems score around 60% while humans achieve 80%+ 33
34 Error propagation Imagine you have a pipeline with 7 algorithms Each has 95% F-Measure and is only based on the preceding one What is your overall F-Measure? In the worst case: F-Measure = = 0,698 In practice, things are often not that bad. The problem is obvious, though: Market Statement Identification Precision Recall 0,95 0,9 0,85 0,8 On annotated data On extracted data 34
35 Outline again What is IE? [1] Introduction, definitions and applications [2] Techniques A common IE pipeline and its algorithms Evaluation [3] Why, what and how to evaluate [4] Outlook Problems, current state and conclusion 35
36 The big problems Limited accuracy is actually often not that a big problem: Even if results are not always accurate, they can be valuable if linked back to the original text (Diana Maynard, University of Sheffield) However, we know how to work on improvements of accuracy The need for training data is a problem of time and resources and leads to the really big problems: Domain dependency Language dependency 36
37 Domain dependency Most IE techniques are more or less domain-dependent! Example from opinion mining: READ THE BOOK! Positive statement on a book? Positive statement on a movie? General problems refer to style of writing, specific scenarios of interest, etc. We need annotated text corpora for training and evaluation from the target domain Collection of documents representative for the domain Manually annotated by domain experts (for the project group: YOU) Alternative: research on domain adaptation, but evaluation problem remains 37
38 Language dependency Many rules or features tend to rely on language-specific properties even if no word lists are used German: capitalization, words are spaced, etc., English: no capitalization Japanese: words aren t even spaced Adaptation to new languages is the focus of much current work Called multilingual or language-independent IE Needed: language-independent features, bilingual dictionaries, annotated or at least unannotated data, support for non-latin scripts, different encodings,... Maybe language adaptation is just a special case of domain-adaptation? However, this is not the focus of the project group! 38
39 Frameworks, toolkits and the like 39
40 What you should have learned... IE is about extracting specific information from natural language text Uses linguistic and statistical phenonema IE is hot for the current industry A common IE pipeline consists of several text analysis algorithms such as Entity Recognition and Relation Extraction Normalization and aggregation Evaluation is necessary to show that your algorithms work IE is often domain-specific and thus needs the creation of and the work with annotated text corpora 40
41 Thanks for your attention! Questions? 41
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
Projektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models
Dissertation (Ph.D. Thesis) An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Christian Siefkes Disputationen: 16th February
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Extraction and Visualization of Protein-Protein Interactions from PubMed
Extraction and Visualization of Protein-Protein Interactions from PubMed Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin Finding Relevant Knowledge Find information about Much
Machine Learning for natural language processing
Machine Learning for natural language processing Introduction Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 13 Introduction Goal of machine learning: Automatically learn how to
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Search Engines Chapter 2 Architecture. 14.4.2011 Felix Naumann
Search Engines Chapter 2 Architecture 14.4.2011 Felix Naumann Overview 2 Basic Building Blocks Indexing Text Acquisition Text Transformation Index Creation Querying User Interaction Ranking Evaluation
Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
11-792 Software Engineering EMR Project Report
11-792 Software Engineering EMR Project Report Team Members Phani Gadde Anika Gupta Ting-Hao (Kenneth) Huang Chetan Thayur Suyoun Kim Vision Our aim is to build an intelligent system which is capable of
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
SVM Based Learning System For Information Extraction
SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk
Named Entity Recognition Experiments on Turkish Texts
Named Entity Recognition Experiments on Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK - Uzay Institute, Ankara - Turkey [email protected] 2 Dept. of Computer Engineering, METU, Ankara - Turkey
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Survey Results: Requirements and Use Cases for Linguistic Linked Data
Survey Results: Requirements and Use Cases for Linguistic Linked Data 1 Introduction This survey was conducted by the FP7 Project LIDER (http://www.lider-project.eu/) as input into the W3C Community Group
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Introduction to Text Mining. Module 2: Information Extraction in GATE
Introduction to Text Mining Module 2: Information Extraction in GATE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
Automated Extraction of Security Policies from Natural-Language Software Documents
Automated Extraction of Security Policies from Natural-Language Software Documents Xusheng Xiao 1 Amit Paradkar 2 Suresh Thummalapenta 3 Tao Xie 1 1 Dept. of Computer Science, North Carolina State University,
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories
Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories Contents. 1. Motivation 2. Scenarios 2.1 Voice box / call-back 2.2 Quality management 3. Technology
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Web Database Integration
Web Database Integration Wei Liu School of Information Renmin University of China Beijing, 100872, China [email protected] Xiaofeng Meng School of Information Renmin University of China Beijing, 100872,
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy
The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy Presented by Mat Kelly CS895 Web-based Information Retrieval
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Why are Organizations Interested?
SAS Text Analytics Mary-Elizabeth ( M-E ) Eddlestone SAS Customer Loyalty [email protected] +1 (607) 256-7929 Why are Organizations Interested? Text Analytics 2009: User Perspectives on Solutions
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
NetOwl(TM) Extractor Technical Overview March 1997
NetOwl(TM) Extractor Technical Overview March 1997 1 Overview NetOwl Extractor is an automatic indexing system that finds and classifies key phrases in text, such as personal names, corporate names, place
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu
Constructing a Generic Natural Language Interface for an XML Database Rohit Paravastu Motivation Ability to communicate with a database in natural language regarded as the ultimate goal for DB query interfaces
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.
Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Stanford s Distantly-Supervised Slot-Filling System
Stanford s Distantly-Supervised Slot-Filling System Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning Computer Science Department,
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
OUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
INTRODUCING AZURE SEARCH
David Chappell INTRODUCING AZURE SEARCH Sponsored by Microsoft Corporation Copyright 2015 Chappell & Associates Contents Understanding Azure Search... 3 What Azure Search Provides...3 What s Required to
Coffee Break German Lesson 06
LESSON NOTES WIE VIEL KOSTET DAS? In this episode of Coffee Break German we ll start by learning the numbers from zero to ten and then learn to deal with transactional situations involving paying for things
Segmentation and Classification of Online Chats
Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Abstract One method for analyzing textual chat
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
Index Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources
Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
A Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract
IBM SPSS Modeler Premium
IBM SPSS Modeler Premium Improve model accuracy with structured and unstructured data, entity analytics and social network analysis Highlights Solve business problems faster with analytical techniques
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
WHITEPAPER. Text Analytics Beginner s Guide
WHITEPAPER Text Analytics Beginner s Guide What is Text Analytics? Text Analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
Web Data Extraction: 1 o Semestre 2007/2008
Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
Applying Data Science to Sales Pipelines for Fun and Profit
Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Big Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
Removing Web Spam Links from Search Engine Results
Removing Web Spam Links from Search Engine Results Manuel EGELE [email protected], 1 Overview Search Engine Optimization and definition of web spam Motivation Approach Inferring importance of features
Microsoft FAST Search Server 2010 for SharePoint Evaluation Guide
Microsoft FAST Search Server 2010 for SharePoint Evaluation Guide 1 www.microsoft.com/sharepoint The information contained in this document represents the current view of Microsoft Corporation on the issues
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
