Joint Research Centre



Similar documents
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Search and Information Retrieval

Delivering Smart Answers!

TO ASK FOR YOUR FREE TRIAL: lexum.com/decisia. OR CONTACT US: EFFICIENT ACCESS TO YOUR DECISIONS

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Enhancing Lotus Domino search

How To Make Sense Of Data With Altilia

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

Text Mining - Scope and Applications

Unifying Search for the Desktop, the Enterprise and the Web

Applications of Deep Learning to the GEOINT mission. June 2015

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Text Mining and Analysis

A collaborative platform for knowledge management

Enhancing Document Review Efficiency with OmniX

AIIM ECM Certificate Programme

Semantic Search in Portals using Ontologies

Why are Organizations Interested?

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

TEXT ANALYTICS INTEGRATION

DIGITAL MARKETING PROPOSAL. Stage 1: SEO Audit/Correction.

SEO Consulting Services By Cromosys. [Strategy & Plan]

The following contains only Unclassified Information. Tyson Johnson

ifinder ENTERPRISE SEARCH

ANSYS EKM Overview. What is EKM?

Draft Response for delivering DITA.xml.org DITAweb. Written by Mark Poston, Senior Technical Consultant, Mekon Ltd.

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Data Visualization More Than Just a Pretty Picture

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

Website Design & Development Deliverables

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Vehicle Tracking System.

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Website Marketing Audit. Example, inc. Website Marketing Audit. For. Example, INC. Provided by

Content Management Policy: Legal Aid NSW website and intranet

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Survey Results: Requirements and Use Cases for Linguistic Linked Data

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper

Contents. Meltwater Quick-Start Guide

Experimental study of beam hardening artefacts in photon counting breast computed tomography

Autonomy Consolidated Archive

SEVENTH FRAMEWORK PROGRAMME THEME ICT Digital libraries and technology-enhanced learning

Auto-Classification for Document Archiving and Records Declaration

Automated Data Acquisition & Analysis. Revolutionize Validation Testing & Launch With Confidence

The Open Source CMS. Open Source Java & XML

Document Management with. first impressions

HydroDesktop Overview

Semaphore Overview. A Smartlogic White Paper. Executive Summary

Research of Postal Data mining system based on big data

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Developing Microsoft SharePoint Server 2013 Advanced Solutions

State Records Guideline No 18. Managing Social Media Records

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

A Statistical Text Mining Method for Patent Analysis

How To Write A Request For Information (Rfi)

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Tracking True & False Demystifying Recruitment Marketing Analytics

Get to Grips with SEO. Find out what really matters, what to do yourself and where you need professional help

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Entwickler. SharePoint Foundation. Standard Edition. Enterprise Edition

Content management system comparison

Searching biomedical data sets. Hua Xu, PhD The University of Texas Health Science Center at Houston

Multichannel analytics and discovery

Managing explicit knowledge using SharePoint in a collaborative environment: ICIMOD s experience

THE RECRUITMENT INDUSTRY ONLINE

Text Analytics Software Choosing the Right Fit

Information about the T9 beam line and experimental facilities

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Safe Harbor Statement

Basic Search Engine Handbook for Recruiters Use Search Engines to identify candidates on the Internet

THE EUROPEAN DATA PORTAL

MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS

Web 3.0 image search: a World First

JamiQ Social Media Monitoring Software

Transcription:

Joint Research Centre Open Source Monitoring Tools and Applications emm.newsbrief.eu Serving society Stimulating innovation Supporting legislation

Open Source Monitoring - Overview EMM Introduction Custom Domains Processing Features End User Applications Collaboration Spotting 2

EMM Architecture 2

(Definition) Social Media* WWW Blogs

(Definition) Sources Input 4000 News Sites 175000 articles per day Languages 70 Categories Classes 1000 classes 30000 keywords Social Media* Blogs WWW Runs 24/7 Visitors 25000 Developed, Built and Maintained by JRC

EMM Categories Powerful classification engine Based on user defined keywords/patterns Allows boolean combinations, proximity and wildcards Support for Arabic and similar (automatic pronoun prefixing) Support for chinese and similar (no whitespace) Categories can be overlapping, no ontology, Multilingual Categories defined for: Countries Themes, EC- Institutions and Agencies, Policy Areas, Commissioners, Diseases and many many more 5

Example of 1 category

Automated Entity Extraction 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms.

Automated Entity Extraction 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr

Automated Entity Extraction Geo Tagger Multi-lingual gazetteer of over1.5 million entries (growing) 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr

Automated Entity Extraction Geo Tagger Multi-lingual gazetteer of over1.5 million entries (growing) 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Event Extraction Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr

Automated Entity Extraction Geo Tagger Powerful Categorisation Engine (a.k.a. Alerts) Tonality Sentiment Detection Duplicate Detection 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Automatic Language Detection Meta-Data Filtering Clustering and Story Tracking Alerting System (SMS - EMAIL) Index of all text and metadata (Search) Multi-lingual gazetteer of over1.5 million entries (growing) Statistical Analyser RSS/KML Services for all extracted information Event Extraction In-line Statistical machine Translation Multi-document summarization. Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins

EMM OSINT Suite Desktop software frontline toolkit for analysts in law enforcement - extending for TTO use. Exploits EMM Technology Tools For typical OSInt Process Google/Bing searches automated result caching WebSite Crawling Document import and analysis (PDF, WORD) Database import Tools for drilling down the extracted data Easy to download and install, use wiki is here: wiki.emm4u.eu EMail gerhard.wagner@jrc.ec.europa.eu Acquire Documents Extract Information Analyse Organise 10

Ontopopulis: Automated Category Learning a weakly-supervised multi-lingual system for statistical knowledge-poor leaning of semantic classes and co-occurring terms inputs: set of words categorized in different semantic classes, and unannotated text corpus The system learns typical contexts for each of the input classes and then learns additional terms from these classes and also cooccurring terms Ontopopulis uses vector space models to present each input term and category via a vector of its typical contexts 15

Ontopopulis Architecture Text corpus Seed: train bus truck car Extraction of contextual features Contextual features: driver of the X : 2.6 X plowed : 2.2 X was parked : 2.2 stopped a X : 2.2 collided with another X : 2.1 Stop words New term extraction New terms: vehicle van lorry taxi minibus

Experiment with Collaboration Spotting Project Pixel Detectors Medipix Timepix OSInt Pilatus Google Patent Archive Corpus Corpus 200 Corpus 200 Corpus docs 200 docs 200 docs docs Ontopopulis Categories

Some newly learned terms for pixel photon counting 25.690942274867396 pixel detector 20.253963170097503 hybrid pixel 16.44489017531264 detectors atlas 15.669138788947823 counting 12.693406624130416 medipix3 8.992173355455991 cms 8.60667074164573 neutron 8.546071244735776 cmos 7.491286816662063 pixel 6.660923550786378 cdte 6.649387231206846 asic 6.39956925683869 photon 6.118614842547322 hybrid semiconductor 6.063997576239122 silicon pixel 5.803307257335719 dectris 5.733415155500664 readout chip 5.626406711785769 readout 5.578930278060624 ray 5.497647183200986 hybrid silicon 5.425429835817261 silicon 4.826877570299286 ccd 4.316783340210495 cmos pixel 4.284586748172906 prototype 3.6989841156280217 gamma ray 3.6888069666365833 pilatus detector 3.6853366151703963 gaas 3.6110798743358266 scintillation 3.6046300556624353 position sensitive 3.4549190788093114 modern 3.3905957242480973 This experiment was carried out with the Collaboration Spotting project at CERN The class expansion algorithm learned new candidate terms. We evaluated the top-scored 60 and found that about 90% of them are relevant terms, representing other types of pixel detectors or at least they are strongly related to the domain photo 3.3367462818562386 pixilated 3.2710198293767707

Thank You emm.newsbrief.eu wiki.emm4u.eu gerhard.wagner@jrc.ec.europa.eu 19