Michel de Ru michel.de.ru@dayon.nl
Newz.nl When competitors band together Introduction Michel de Ru Solution architect @ Dayon 16 years experience in publishing Among others Wolters-Kluwer, Sdu (ELS) and Dutch Railways Specialized in Content related Big Data challenges Specialized in added value through Semantic Technology Dayon, part of the HintTech Group We design, build and maintain content driven online and mobile applications We help customers develop their Content Strategy We realize it using Content Technology Partners include MarkLogic, Ontotext, Alfresco, Hippo CMS, Solr and OpenText Big Data projects for Dutch Public Library, Kluwer, Newz
Newz.nl When competitors band together Contents Challenges for news publishers Unified search and delivery platform How we built it New use-cases Conclusion Questions
Challenges for news publishers
Mln Euro Turn-key platform Newz Main challenges Traditional market is changing rapidly Publishers are under pressure 2,0 Gross income 1,6 1,2 0,8 0,4 0,0 2002 2003 2004 2005 2006 2007 2008 2008 2010 2011 2012 Public perception News for free Competitors steal content/piracy New initiatives have no content available
Main challenges Possible answers Ignore and let it happen Prohibit / forbid the use of our content Outsmart the competition We as combined publishers have the best quality and can offer the best knowledge about our combined news content
Main targets Efficiency: standardized B2B service delivery One delivery platform for all daily news publishers Quality: automated metadata enrichment Most newspapers don t have metadata available Linked open data and semantic technologies Centralized semantic curation Enable new business opportunities Centralized sourcing/licensing of content Incubator for new initiatives
Newz in the news
NDP Nieuwsmedia in the news Newz video in Dutch See www.newz.nl
The project 2012 January : RFI/RFP June : Dayon/Ontotext selected July : Project kick-off December: Newz BV founded 2013 March : System delivered July : Production
How it works
Data Journalistiek Applicatie
How we put it together
Newz.nl Big Data approach Volume Velocity Terra/Petabytes Catalogs Files Content Real time Near time Batch Streaming Volume Velocity Variety Big Data Variety Unstructured Structured Text, image, video Mixed rich content Value Value creation Added value Solutions
Dutch news = Big Data Volume Velocity Volume 15.000 news articles a day Variety Value Velocity Delivery spike during 2 hours a day (just before the morning starts) Usage is continuously (through API, Search and Subscription interfaces) Variety News articles without metadata and no structure whatsoever Linked Open Data Value Facilitate new News business solutions for integrators, app suppliers, etc. Deliver a standardized (NITF NewsML) and enriched format
Key aspects Big Data Content Store Enterprise NoSQL Volume Velocity Structured/unstructured ACID compliant (Atomicity, Consistency, Isolation, Durability) Semantic Technologies Concept extraction Variety Linked (Open) Data Graph databases / Inferencing Content Lifecycle Management Part of Application Lifecycle Management
Volume, Velocity Interface with News publishers Content Processing Framework Added a Java layer for full ETL and trailing capabilities Storage of News articles In cooperation with IPTC a Dutch version of NewsML-G2 has been defined Interface with Semantic Extraction framework Full search capabilities Enterprise grade We also calculated a MongoDB/Lucene solution ML won on: TCO, Success rate of business implementations, Enterprise resilience
Variety Semantic Extraction Existing news vocabularies and taxonomies + Linked Open Data World class Semantic Extraction (NLP, Golden Standard, Rules, etc.) Conversion to an ontology (similar to semantic web) Triples stored in OWLIM Enterprise Enrichment of news articles Organizations Persons Locations Events Keywords Mentions From a lot of data To even more data!
e.g. Barack Obama e.g. Democratic Party e.g. Netherlands
Architecture overview
Use cases
Voorbeeld: Automatische geo taxonomie Nieuwsartikel gaat over Haditha in Irak Wat als je meer wilt weten over de regio? 1. Artikel is semantisch verrijkt met de plaatsnaam 2. Op basis van Linked Open Data wordt een taxonomie getoond 3. Daarmee kan alle content die over de regio gaat gevonden worden
Voorbeeld: tijd reizen door infographics
Voorbeeld: Research Research over bepaalde onderwerpen Geef de meest relevante artikelen Geef relevatie in de tijd gezien Geef de mogelijkheid tot een verdiepende zoektocht
Voorbeeld: Mashups Research over bepaalde onderwerpen Verrijk resultaat met Verrijk Linked Open resultaat met Data Linked Open Data Verrijk resultaat met eigen taxonomie / ontologie
Concluding
Newz.nl When competitors band together In conclusion Lessons learned Hard to keep 12 frogs in the wheelbarrow, but we succeeded! Semantic Extraction is a gigantic specialism, start early in the project! MarkLogic delivers to it s promises, don t think about it just use it! Business opportunities need both commercial and technical perspectives! Next steps Commercialize the Newz organization Connect more publications Enable market initiatives by using the Newz API (commercial and non-commercial) Enable re-use within the publishers
Any Questions?