Joint Research Centre Open Source Monitoring Tools and Applications emm.newsbrief.eu Serving society Stimulating innovation Supporting legislation
Open Source Monitoring - Overview EMM Introduction Custom Domains Processing Features End User Applications Collaboration Spotting 2
EMM Architecture 2
(Definition) Social Media* WWW Blogs
(Definition) Sources Input 4000 News Sites 175000 articles per day Languages 70 Categories Classes 1000 classes 30000 keywords Social Media* Blogs WWW Runs 24/7 Visitors 25000 Developed, Built and Maintained by JRC
EMM Categories Powerful classification engine Based on user defined keywords/patterns Allows boolean combinations, proximity and wildcards Support for Arabic and similar (automatic pronoun prefixing) Support for chinese and similar (no whitespace) Categories can be overlapping, no ontology, Multilingual Categories defined for: Countries Themes, EC- Institutions and Agencies, Policy Areas, Commissioners, Diseases and many many more 5
Example of 1 category
Automated Entity Extraction 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms.
Automated Entity Extraction 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr
Automated Entity Extraction Geo Tagger Multi-lingual gazetteer of over1.5 million entries (growing) 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr
Automated Entity Extraction Geo Tagger Multi-lingual gazetteer of over1.5 million entries (growing) 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Event Extraction Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr
Automated Entity Extraction Geo Tagger Powerful Categorisation Engine (a.k.a. Alerts) Tonality Sentiment Detection Duplicate Detection 500.000 persons and organizations based on continuously updated list of entities, many language specific synonyms. Quote Extraction Automatic Language Detection Meta-Data Filtering Clustering and Story Tracking Alerting System (SMS - EMAIL) Index of all text and metadata (Search) Multi-lingual gazetteer of over1.5 million entries (growing) Statistical Analyser RSS/KML Services for all extracted information Event Extraction In-line Statistical machine Translation Multi-document summarization. Supported Languages:ar,bg,da,de,en,es,et,fr,it,nl,no,pl,pt,ro,ru,sl,sv,sw,tr
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
Web Interface HTML/KML/RSS Index Search SMS/Email Notifications Situation Reports/Bulletins
EMM OSINT Suite Desktop software frontline toolkit for analysts in law enforcement - extending for TTO use. Exploits EMM Technology Tools For typical OSInt Process Google/Bing searches automated result caching WebSite Crawling Document import and analysis (PDF, WORD) Database import Tools for drilling down the extracted data Easy to download and install, use wiki is here: wiki.emm4u.eu EMail gerhard.wagner@jrc.ec.europa.eu Acquire Documents Extract Information Analyse Organise 10
Ontopopulis: Automated Category Learning a weakly-supervised multi-lingual system for statistical knowledge-poor leaning of semantic classes and co-occurring terms inputs: set of words categorized in different semantic classes, and unannotated text corpus The system learns typical contexts for each of the input classes and then learns additional terms from these classes and also cooccurring terms Ontopopulis uses vector space models to present each input term and category via a vector of its typical contexts 15
Ontopopulis Architecture Text corpus Seed: train bus truck car Extraction of contextual features Contextual features: driver of the X : 2.6 X plowed : 2.2 X was parked : 2.2 stopped a X : 2.2 collided with another X : 2.1 Stop words New term extraction New terms: vehicle van lorry taxi minibus
Experiment with Collaboration Spotting Project Pixel Detectors Medipix Timepix OSInt Pilatus Google Patent Archive Corpus Corpus 200 Corpus 200 Corpus docs 200 docs 200 docs docs Ontopopulis Categories
Some newly learned terms for pixel photon counting 25.690942274867396 pixel detector 20.253963170097503 hybrid pixel 16.44489017531264 detectors atlas 15.669138788947823 counting 12.693406624130416 medipix3 8.992173355455991 cms 8.60667074164573 neutron 8.546071244735776 cmos 7.491286816662063 pixel 6.660923550786378 cdte 6.649387231206846 asic 6.39956925683869 photon 6.118614842547322 hybrid semiconductor 6.063997576239122 silicon pixel 5.803307257335719 dectris 5.733415155500664 readout chip 5.626406711785769 readout 5.578930278060624 ray 5.497647183200986 hybrid silicon 5.425429835817261 silicon 4.826877570299286 ccd 4.316783340210495 cmos pixel 4.284586748172906 prototype 3.6989841156280217 gamma ray 3.6888069666365833 pilatus detector 3.6853366151703963 gaas 3.6110798743358266 scintillation 3.6046300556624353 position sensitive 3.4549190788093114 modern 3.3905957242480973 This experiment was carried out with the Collaboration Spotting project at CERN The class expansion algorithm learned new candidate terms. We evaluated the top-scored 60 and found that about 90% of them are relevant terms, representing other types of pixel detectors or at least they are strongly related to the domain photo 3.3367462818562386 pixilated 3.2710198293767707
Thank You emm.newsbrief.eu wiki.emm4u.eu gerhard.wagner@jrc.ec.europa.eu 19