Adobe Semantic Analysis Platform



Similar documents
How To Make Sense Of Data With Altilia

Search and Information Retrieval

Flattening Enterprise Knowledge

Information Retrieval Elasticsearch

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

How To Manage Your Digital Assets On A Computer Or Tablet Device

Microsoft FAST Search Server 2010 for SharePoint Evaluation Guide

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Search Engines. Stephen Shaw 18th of February, Netsoc

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Distributed Computing and Big Data: Hadoop and MapReduce

ifinder ENTERPRISE SEARCH

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

US Patent and Trademark Office Department of Commerce

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Content Management Systems: Drupal Vs Jahia

A guide to the lifeblood of DAM:

Session 2: Designing Information Architecture for SharePoint: Making Sense in a World of SharePoint Architecture

Semaphore Overview. A Smartlogic White Paper. Executive Summary

Clustering Technique in Data Mining for Text Documents

Digital Asset Management and Controlled Vocabulary

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

IT Insights. Using Microsoft SharePoint 2013 to build a robust support and training portal. A service of Microsoft IT Showcase

Key Pain Points Addressed

IBM Software Group Thought Leadership Whitepaper. IBM Customer Experience Suite and Enterprise Search Optimization

Enhancing Document Review Efficiency with OmniX

IT services for analyses of various data samples

A collaborative platform for knowledge management

Why are Organizations Interested?

The Prolog Interface to the Unstructured Information Management Architecture

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company

Unifying Search for the Desktop, the Enterprise and the Web

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Ganzheitliches Datenmanagement

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

4.10 Reports. RFP reference: 6.10 Reports, Page 47

Co-evolving document collections and knowledge structures. CoDAK. Dr. Evgeny Knutov! ! (MSc Seminar Nov )

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

T HE I NFORMATION A RCHITECTURE G LOSSARY

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Term extraction for user profiling: evaluation by the user

Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management

Sustainable Development with Geospatial Information Leveraging the Data and Technology Revolution

HOW TO DO A SMART DATA PROJECT

Digital Asset Management

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

SAS BI Course Content; Introduction to DWH / BI Concepts

Text Mining and Analysis

Anotaciones semánticas: unidades de busqueda del futuro?

Digital Asset Management. Content Control for Valuable Media Assets

Text Analytics Software Choosing the Right Fit

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Technical Report. The KNIME Text Processing Feature:

WEB& WEBSITE DESIGN TRAINING

Ontology based ranking of documents using Graph Databases: a Big Data Approach

Search Result Optimization using Annotators

Web 3.0 image search: a World First

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Generating Advertising Keywords from Video Content

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

The Core Pillars of AN EFFECTIVE DOCUMENT MANAGEMENT SOLUTION

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

A Framework for Ontology-Based Knowledge Management System

Developing Microsoft SharePoint Server 2013 Advanced Solutions

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Digital Asset Management 数 字 媒 体 资 源 管 理 任 课 老 师 : 张 宏 鑫

Big Data and Analytics: Challenges and Opportunities

Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide

Business Intelligence: Recent Experiences in Canada

EC Wise Report: Unlocking the Value of Deeply Unstructured Data. The Challenge: Gaining Knowledge from Deeply Unstructured Data.

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Semantic Search in Portals using Ontologies

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Structured Content: the Key to Agile. Web Experience Management. Introduction

Enhancing Web Publishing with Digital Asset Management - Using Open Text Artesia DAM to enhance your Open Text WCMS (Red Dot) web sites

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

ENTERPRISE DOCUMENT MANAGEMENT SYSTEM

HP Systinet. Software Version: Windows and Linux Operating Systems. Concepts Guide

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

SOA REFERENCE ARCHITECTURE: WEB TIER

Joint ICTP-IAEA School of Nuclear Knowledge Management. 1-5 September Improving Organizational Performance with a KM System

Improving EHR Semantic Interoperability Future Vision and Challenges

ER/Studio Enterprise Portal User Guide

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Authoring Within a Content Management System. The Content Management Story

Big Data & Security. Aljosa Pasic 12/02/2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Content Delivery Service (CDS)

Data Integration Hub for a Hybrid Paper Search

Data Search. Searching and Finding information in Unstructured and Structured Data Sources

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Transcription:

Adobe Semantic Analysis Platform Sept. 3, 2008 Walter W. Chang Senior Computer Scientist Advanced Technology Labs Adobe Systems, Inc.

Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends

Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends

Project history Semantic analysis platform for documents project started in 2005 in Adobe Advanced Technology Labs Targeted for enterprise gov. intelligence document workflows Identified growing opportunity in contextualized advertising Launched public Ads for Adobe PDF system in Nov. 2007 Document analysis and topic/keyword recommendations for Yahoo! Ads 60 registered publishers (e.g., IEEE, CMP), 40 pending, others in disc. (Scientific America)

Challenges in developing our platform Finding correct set of analysis methods to understand documents Layers of representation and structure in documents Varying degrees of semantic noise Popular analysis methods are TFIDF-based Prioritizing results from all analysis methods Handling multi-theme documents Fluid, dynamic nature of ontologies

Key problem statement For document X, determine Aboutness ( X ) What are the main topics and concepts in X? Contextual model for X? Intentional attributes of X? To compute Aboutness( X ), a content intelligence system needs: Text extraction Metadata identification and extraction Extraction and statistical analysis of content N-grams Shallow and deep semantic analysis methods Mechanisms for generation of contextual ad metadata

Semantic model to address key problem For document X, determine Aboutness ( X ) Main Topics and concepts in X Contextual model for X Intentional attributes of X Develop a canonical semantic model for X: Topic domain contextualzation (CV, Ontology) Surface semantics Concept subontology Intentional semantics Text extraction Statistical BOW models N-gram TF-IDF & distribution Taxonomy/ontology based classifiers Theme-based gist / summarization NLP + deep semantic analysis Sentiment analysis Inference and rule engines

Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends

Overview of Semantic Analysis Techniques Text extraction, lexical encoding and normalization Extensions to TF-IDF Keyword and N-gram models Document level Page level Employing ontologies Concept/topic analysis Summary/gist creation Domain expertise via rule engine Analysis result weighting

Text Extraction Challenges Missing document info (OCR, PDF) Complex reading order within document layout Presence of document noise (headers/footers)

Text Extraction Approach Use positional layout of text for inferring structure Vertical and horizontal ray projection text density Sampling to infer word, sentence and column gutter spacing Use statistical methods and heuristics to find text zones

Text Extraction Approach Recursively subdivide page into text zones Use heuristics to iterative scan text in each zone Re-synthesize likely reading order text per zone Identify and remove semantic noise Noise artifacts

Semantic Analysis Techniques Determine surface aboutness( x ) for document Normalize, find keywords and N-grams Perform statistical analysis: TFIDF analysis on keywords/n grams Term distribution analysis Rank terms by frequency & section weights term cluster position Categorize document by topics and concepts Summarize document Generate and submit terms to query the ad aggregator s inventory

Keyword and N-gram Analysis Text of source document Normalize pluralization, tense ( sky, skies, etc.) Term Frequency analysis Stopword filtering Stemming (e.g. Porter, Krovetz) Term N-gram extraction Remove trivial stopwords ( the, a, etc.) Find term n-grams ( British Columbia, relational database, etc.) Term Distribution analysis

Basic Keyword and Key Term Analysis Term Frequency analysis T F Term(i) Term(j) Count Term Distribution analysis 2 S.D. max(pos) Use TF IDF for surface analysis of semantics min(pos ) Term(i) avg(pos) +document level + page level 2 S.D. Term(i) avg. position 2 S.D. Use N-Gram distribution analysis to find topic center of gravity Term(j) avg position

Semantic Analysis Techniques How well does statistical document aboutness work? Reasonable results in many cases, but.. Problems: Semantic model based on term strength and co-occurrence Sensitive to writing styles that skew N-gram distributions Poor selectivity for multi-topic documents Need: Semantic model of content (e.g. weighted topic tree) Logic-based inferencing using key topics Mechanism to weigh statistical and symbolic semantics

Build Semantic Model of Concepts Goal: Construct concept/topic graph for document How: Use document categorization analysis methods to build topic hierarchy Leverage term statistics to identify strongest topics Leverage external taxonomy/thesaurus/ontologies Use topic supertypes for generalization E.g.: soccer field game outdoor game sport

How does an Ontology work? Use standardized term relationships Class Generalization / Specialization Instance Generalization / Specialization Class Relationships Ontology Thesauri example Enables upper TM platform layers: e.g., semantic analysis Relation Key NT Narrower Term BT Broader Term SN Synonym RT Related Term UF Use For TT Top Term NT Fruits Agriculture Products BT RT Vegetables Produce UF Term Non-preferred term Herbaceous plants Apples Pears Carrots

Example document: Travel guide for Canada PDF : 1 1000 pages, average = 5 pages, multiple subtopics Well written text, HS to college-level English Well-structured topically Domain terminology

Document Topic/Concept Extraction Section Weights Term Frequencies & Distributions Text stream filter Tokenizers Stopword filters Term stemmers Sentence segmenter Topic / Concept Extractor Ontology Manager Taxonomy / Thesaurus Inferencing Topic / Concept Weighting Scoring Rules 0.0. - geography 0.0 +-- physical geography 0.0 +-- bodies of water 2.0 -- oceans 0.0 -- land forms 5.0 -- mountains 0.0 +-- political geography 0.0 -- North America 5.0 +-- United States 20.0 -- Canada 4.0 +-- Alberta 11.0 - Newfoundland ----------+ 0.0 -- culture & society 0.0 -- leisure & recreation 4.0 +-- vacations 0.0 -- arts & entertainment 0.0 -- broadcast media 3.0 -- television 0.0 -- technology & sciences 0.0 +-- social sciences 4.0 history 0.0 -- transportation 0.0 -- travel industry 14.0 -- tourism Document concept taxonomy

Semantic Analysis Techniques Observation: still other valuable concepts present Use document summarization analysis methods Goal: Capture key statement semantics via sampling Leverage topics/concepts to identify best sentences to extract into summary Leverage external taxonomy / thesaurus / ontology Find terms that support more general topics/concepts E.g.: mention of sightseeing supports tourism theme E.g.: mention of British Columbia supports Canada theme

Document Summarization Section Weights Term Frequencies & Distributions Text stream filter Tokenizers Stopword filters Term stemmers Sentence segmenter Topic / Concept Based Sentence Extractor Ontology Manager Topic based sentence selection Sentence Weighting Weighting Rules This will take you to our Virtual Canada Book web site to view or download video clips. They are linked to our website. Inquiries about this ebook should be sent to info@bcpictures.com. Virtual Canada Contents Introduction to Canada: A country of many colors. British Columbia and Vancouver Island: for the most scenic of mountain panoramas. The Government Offices. This site provides information on federal programs and services, departments and agencies. VIA Rail Canada VIA operates trains in all regions of Canada over a network spanning the country from the Atlantic to the Pacific. Greyhound Canada Coach service to nearly 1,100 towns and cities in Canada, as well as the United States. Visit the Canadian Automobile Association CAA offices across the entire country. Pick up cooccurring terms

Weighing statistical & semantic approaches Statistical Keywords (TF-IDF) Relevance Tokenizers Statistical N Grams ω0 Inventory match Monetization Input Document Text stream filter Stopword filters Term stemmers Sentence segmenter Topic / Concept Extraction ω1 ω2 : ωi : ωn Document Essence Human Evaluation Text of source document Summarization Ontology-based Inferencing No explicit ground truth! Lots of parameters & weights Difficult to tune & stabilize Changes will break things Infer and approximate conceptual and intentional semantics of content

Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends

Architecture for a semantic analysis platform Framework for modular semantic analysis workflows (similar platforms e.g., IBM UIMA) Use Adobe proprietary and 3 rd party semantic services One interchange format for all semantic metadata Open language, server, database architecture C/C++, Java, PHP, Python Apache, Tomcat Oracle, SQLite, and JDBC accessible database Services orchestrated by WF engine

Adobe content intelligence platform 1 Input document 2 Extract, structure, & create text 3 Create semantic metadata & tags 4 Normalize & persist metadata 5 Retrieve, filter, and analyze all metadata 6 Score metadata & create essence Content input Text extraction Metadata Generation Metadata Persistence Semantic Analysis Essence Generation Documents Upload interface Tools & utilities CMS adapters Layout extraction Page/section segmentation Text extraction Text glyph filtering Keyterm entity extractor Categorizer & theme analyzer Summarizers XMP metadata services Metadata persistence services Category & summary filters Category taxonomy rule engine Weight categories & themes Recommend rule-based categories < XML > Crawlers Stopword filtering Term stemming Other extractors & analyzers Metadata Repository Adobe keyterm ranker Recommend doc & page Keyterms Commercial & open source taxonomies Taxonomies & ontologies Domain taxonomies Generic taxonomies Taxonomy & ontology builder

Semantic analysis processing node i > Doc.Reg. process 01 Doc.Reg. process 02 Job Queue PDF file1 PDF file2 PDF file3 PDF file4 PDF file5 PDF file6 : Layout Keyterm Upload extraction entity interface extractor Page/section segmentation Categorizer & Tools & theme utilities Text extraction analyzer CMS Text glyph Documents adapters filtering Summari zers Stopword Crawlers filtering Other Term extractors & stemming analyzers XMP metadata services Metadata persistence services M e t a d a t a R e p o s t Category & summary filters Category taxonomy rule engine Adobe keyterm ranker Weight categories & themes Recommend rule-based categories Recommend doc & page Keyterms < X M L o r ysemantic analysis WF 01 Semantic analysis WF 02 Semantic analysis WF 03 : Semantic analysis WF 10 Semantic analysis WF 11 Semantic analysis WF 12 Semantic analysis WF 13 : Semantic analysis WF 20 Each semantic analysis workflow = 1 thread 10 analysis threads/process svr process 01 svr process 02 Doc.Reg. process 03 Semantic analysis WF 21 Semantic analysis WF 22 Semantic analysis WF 23 Semantic analysis WF 30 : svr process 03 Doc.Reg. process 04 Semantic analysis WF 31 Semantic analysis WF 32 Semantic analysis WF 33 Semantic analysis WF 40 : svr process 04

Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends

Screenshot Demo Ads for Adobe PDF Powered by Yahoo! Hosted in Adobe co-location Launched public beta Q4 2007 60+ publishers participating System workflows: User Registration Semantic Analysis PDF Interaction

Marketing Website @ Adobe Labs http://labs.adobe.com/technologies/adsforpdf

Login to Adobe Portal

Proceed to Adobe Portal

Example document: Travel guide for Canada PDF : 1 1000 pages, average = 5 pages, multiple subtopics Well written text, HS to college-level English Well-structured topically Domain terminology

Publish the PDF Adobe semantic metadata used to match against ad inventory On ad click, ad network provider, Adobe, and content publisher share ad revenue

Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends

Summary Launched new semantic service: Ads for Adobe PDF Features in 1.1 Page-level analysis for page specific ads High volume registration and analysis scalability: publishers with millions of PDFs Adobe content intelligence platform using Semantic model of content multi-level semantic analysis Allows publishers to easily monetize content Combines: Statistical keyword analysis Document topic analysis and summarization Ontology and rules-based inferencing

Lessons Learned in 1.0 Need to use a hybrid semantic analysis approach: Statistical methods based on N-grams (TF/IDF) Ontologies are key: Machine learning and automatic construction Symbolic theme/topic inference engine Logic rule engines to deal with intentional semantics Document topic analysis problem: long documents, multiple topics Aboutness( X ) with generalization Segmentation Need to refine approach to topic segmentation (e.g., Hearst) Plan for ground-truth evaluations Large number of tuning points Use systematic (WF-wide) analysis tracing & logging Understand ad network inventory from provider Adapt to non-linear ad network behavior (revenue vs. relevance)

Future Direction and Trends Need for deeper semantic analysis of text Large scale computational linguistics Use broader knowledge base, e.g., Wikipedia, Google, the Web Automatic targeted ontology learning New vocabulary and topics Topic interrelationships User preference model based on Fine-grained model of content corpus Global user behavior Extensions to other media types: audio and video Speech-to-text Scene analysis, image/object identification