15/02/2013 VRTRESEARCH&INNOVATION Project C: Future CMS Onderzoek mogelijkheden aggregatie en categorisatie content C.3.1.2 Proof of concept aggregatie en categorisatie content VRT RESEARCH &INNOVATION
Scope Content Aggregatie & Categorisatie Binnen een Future Content Management Systeem (externe) content kunnen categoriseren en aggregeren met het oog op een op:malisa:e van het authoring proces enerzijds en het recommenderen van content anderzijds. Strategie Op basis van thesauri (controlled vocabulary) en het Seman:c Web (Linked Open Data LOD) (tekstuele) content verwerken, herkennen en categoriseren met metadata tags. Op basis van deze metadata tags overeenkoms:ge content groeperen en aggregeren.
Architectuur: Enrichment Service Items Enricher Items & tags Thesaurus Linked Open Data
Architectuur: Aggregator Service Items & tags Lily Aggregator Contextualized Grouped items User behavior User Context
Content Enrichment Frameworks : OntoText KIM State-of-The-Art GATE (General Architecture for Text Engineering) Apache Stanbol
Content Enrichment Frameworks : Apache Stanbol Selection PRO Open Source Framework (based on the Apache License) Flexible Set of Reusable components for semantic content management Support for Content Enhancement based on Semantic Engines Support for Custom & Domain Vocabularies (like VRT s Thesaurus) CON Incubator Framework work in progress
FutureCMS: enricher components Setup of RDF triple store with DBPedia, GeoNames & VRT Thesaurus Setup of Apache Stanbol Setup of Enricher Chain Implementation of Polopoly Enrich Service (based on JSON)
Enricher: setup of Stanbol This is done through chef solo provisioning (http://wiki.opscode.com/display/ chef/chef+solo). The chef run can be started on a clean Ubuntu Precise box and will download and build Stanbol and install it as a service. It will also install the Enrichment service which will make use of Stanbol. A vagrant (http://vagrantup.com/) configuration is provided in enrichment/ Provisioning/vagrants/enricher_dev. Running vagrant up in this folder will create a Virtualbox virtual machine with Stanbol and the service on it. We also provide a mccloud (https://github.com/jedi4ever/mccloud) configuration to easily run the chef recipes on the VRT instance.
Enricher: Custom Vocabulary (http://stanbol.apache.org/docs/trunk/customvocabulary.html) The chef run will assemble the indexing tool, and place it in /opt/stanbol/indexingworking-dir. In that folder, you can run sudo java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jarwith-dependencies.jar init! which will set up the folder structure. Then, place your.nt files with the SKOS thesaurus in /opt/stanbol/indexingworking-dir/indexing/resources/rdfdata. Then, run sudo java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jarwith-dependencies.jar index! After a while, two files are created in /opt/stanbol/indexing-working-dir/indexing/ dist: vrtthesaurus.solrindex.zip, which is a Solr Index with the thesaurus labels org.apache.stanbol.data.site.vrtthesaurus-1.0.0.jar, which is a bundle to install in the Felix console of Stanbol.
Enricher: Stanbol Chain Configuration Keyword Linking Chain This chain compares the labels of the controlled vocabulary (VRT s Thesaurus) with the words in the text. It is language independent. The configuration is as follows:
Enricher: Stanbol Chain Configuration For the gaza ar:cle (hup://www.deredac:e.be/cm/vrtnieuws/buitenland/ 1.1484869) the following en::es are recognized: Places: DIE, EEN, EGYPTE, EGYPTISCHE POLITIEKE GROEPERINGEN, GAZA, GEMEENSCHAP, ISLAMITISCHE JIHAD, MAROKKO, MOSLIMBROEDERS, WAREN
Enricher: NER Tagging Chain This chain first performs Named Entity Recognition on the text. It is therefore language dependent. The language of the text is first detected using langid Engine. Afterwards, we try to link the detected entities with the controlled vocabulary using the NER Tagging Engine.
Enricher: NER Tagging Chain
Enricher: NER Tagging Chain For the gaza ar:cle (hup://www.deredac:e.be/cm/vrtnieuws/ buitenland/1.1484869) the following en::es are recognized: People: MOEBARAK HOSNI, MURSI, VRANCKX RUDI Organisa4ons: VRT
C.3.1.2 Enricher: Prototype