Adobe Semantic Analysis Platform Sept. 3, 2008 Walter W. Chang Senior Computer Scientist Advanced Technology Labs Adobe Systems, Inc.
Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends
Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends
Project history Semantic analysis platform for documents project started in 2005 in Adobe Advanced Technology Labs Targeted for enterprise gov. intelligence document workflows Identified growing opportunity in contextualized advertising Launched public Ads for Adobe PDF system in Nov. 2007 Document analysis and topic/keyword recommendations for Yahoo! Ads 60 registered publishers (e.g., IEEE, CMP), 40 pending, others in disc. (Scientific America)
Challenges in developing our platform Finding correct set of analysis methods to understand documents Layers of representation and structure in documents Varying degrees of semantic noise Popular analysis methods are TFIDF-based Prioritizing results from all analysis methods Handling multi-theme documents Fluid, dynamic nature of ontologies
Key problem statement For document X, determine Aboutness ( X ) What are the main topics and concepts in X? Contextual model for X? Intentional attributes of X? To compute Aboutness( X ), a content intelligence system needs: Text extraction Metadata identification and extraction Extraction and statistical analysis of content N-grams Shallow and deep semantic analysis methods Mechanisms for generation of contextual ad metadata
Semantic model to address key problem For document X, determine Aboutness ( X ) Main Topics and concepts in X Contextual model for X Intentional attributes of X Develop a canonical semantic model for X: Topic domain contextualzation (CV, Ontology) Surface semantics Concept subontology Intentional semantics Text extraction Statistical BOW models N-gram TF-IDF & distribution Taxonomy/ontology based classifiers Theme-based gist / summarization NLP + deep semantic analysis Sentiment analysis Inference and rule engines
Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends
Overview of Semantic Analysis Techniques Text extraction, lexical encoding and normalization Extensions to TF-IDF Keyword and N-gram models Document level Page level Employing ontologies Concept/topic analysis Summary/gist creation Domain expertise via rule engine Analysis result weighting
Text Extraction Challenges Missing document info (OCR, PDF) Complex reading order within document layout Presence of document noise (headers/footers)
Text Extraction Approach Use positional layout of text for inferring structure Vertical and horizontal ray projection text density Sampling to infer word, sentence and column gutter spacing Use statistical methods and heuristics to find text zones
Text Extraction Approach Recursively subdivide page into text zones Use heuristics to iterative scan text in each zone Re-synthesize likely reading order text per zone Identify and remove semantic noise Noise artifacts
Semantic Analysis Techniques Determine surface aboutness( x ) for document Normalize, find keywords and N-grams Perform statistical analysis: TFIDF analysis on keywords/n grams Term distribution analysis Rank terms by frequency & section weights term cluster position Categorize document by topics and concepts Summarize document Generate and submit terms to query the ad aggregator s inventory
Keyword and N-gram Analysis Text of source document Normalize pluralization, tense ( sky, skies, etc.) Term Frequency analysis Stopword filtering Stemming (e.g. Porter, Krovetz) Term N-gram extraction Remove trivial stopwords ( the, a, etc.) Find term n-grams ( British Columbia, relational database, etc.) Term Distribution analysis
Basic Keyword and Key Term Analysis Term Frequency analysis T F Term(i) Term(j) Count Term Distribution analysis 2 S.D. max(pos) Use TF IDF for surface analysis of semantics min(pos ) Term(i) avg(pos) +document level + page level 2 S.D. Term(i) avg. position 2 S.D. Use N-Gram distribution analysis to find topic center of gravity Term(j) avg position
Semantic Analysis Techniques How well does statistical document aboutness work? Reasonable results in many cases, but.. Problems: Semantic model based on term strength and co-occurrence Sensitive to writing styles that skew N-gram distributions Poor selectivity for multi-topic documents Need: Semantic model of content (e.g. weighted topic tree) Logic-based inferencing using key topics Mechanism to weigh statistical and symbolic semantics
Build Semantic Model of Concepts Goal: Construct concept/topic graph for document How: Use document categorization analysis methods to build topic hierarchy Leverage term statistics to identify strongest topics Leverage external taxonomy/thesaurus/ontologies Use topic supertypes for generalization E.g.: soccer field game outdoor game sport
How does an Ontology work? Use standardized term relationships Class Generalization / Specialization Instance Generalization / Specialization Class Relationships Ontology Thesauri example Enables upper TM platform layers: e.g., semantic analysis Relation Key NT Narrower Term BT Broader Term SN Synonym RT Related Term UF Use For TT Top Term NT Fruits Agriculture Products BT RT Vegetables Produce UF Term Non-preferred term Herbaceous plants Apples Pears Carrots
Example document: Travel guide for Canada PDF : 1 1000 pages, average = 5 pages, multiple subtopics Well written text, HS to college-level English Well-structured topically Domain terminology
Document Topic/Concept Extraction Section Weights Term Frequencies & Distributions Text stream filter Tokenizers Stopword filters Term stemmers Sentence segmenter Topic / Concept Extractor Ontology Manager Taxonomy / Thesaurus Inferencing Topic / Concept Weighting Scoring Rules 0.0. - geography 0.0 +-- physical geography 0.0 +-- bodies of water 2.0 -- oceans 0.0 -- land forms 5.0 -- mountains 0.0 +-- political geography 0.0 -- North America 5.0 +-- United States 20.0 -- Canada 4.0 +-- Alberta 11.0 - Newfoundland ----------+ 0.0 -- culture & society 0.0 -- leisure & recreation 4.0 +-- vacations 0.0 -- arts & entertainment 0.0 -- broadcast media 3.0 -- television 0.0 -- technology & sciences 0.0 +-- social sciences 4.0 history 0.0 -- transportation 0.0 -- travel industry 14.0 -- tourism Document concept taxonomy
Semantic Analysis Techniques Observation: still other valuable concepts present Use document summarization analysis methods Goal: Capture key statement semantics via sampling Leverage topics/concepts to identify best sentences to extract into summary Leverage external taxonomy / thesaurus / ontology Find terms that support more general topics/concepts E.g.: mention of sightseeing supports tourism theme E.g.: mention of British Columbia supports Canada theme
Document Summarization Section Weights Term Frequencies & Distributions Text stream filter Tokenizers Stopword filters Term stemmers Sentence segmenter Topic / Concept Based Sentence Extractor Ontology Manager Topic based sentence selection Sentence Weighting Weighting Rules This will take you to our Virtual Canada Book web site to view or download video clips. They are linked to our website. Inquiries about this ebook should be sent to info@bcpictures.com. Virtual Canada Contents Introduction to Canada: A country of many colors. British Columbia and Vancouver Island: for the most scenic of mountain panoramas. The Government Offices. This site provides information on federal programs and services, departments and agencies. VIA Rail Canada VIA operates trains in all regions of Canada over a network spanning the country from the Atlantic to the Pacific. Greyhound Canada Coach service to nearly 1,100 towns and cities in Canada, as well as the United States. Visit the Canadian Automobile Association CAA offices across the entire country. Pick up cooccurring terms
Weighing statistical & semantic approaches Statistical Keywords (TF-IDF) Relevance Tokenizers Statistical N Grams ω0 Inventory match Monetization Input Document Text stream filter Stopword filters Term stemmers Sentence segmenter Topic / Concept Extraction ω1 ω2 : ωi : ωn Document Essence Human Evaluation Text of source document Summarization Ontology-based Inferencing No explicit ground truth! Lots of parameters & weights Difficult to tune & stabilize Changes will break things Infer and approximate conceptual and intentional semantics of content
Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends
Architecture for a semantic analysis platform Framework for modular semantic analysis workflows (similar platforms e.g., IBM UIMA) Use Adobe proprietary and 3 rd party semantic services One interchange format for all semantic metadata Open language, server, database architecture C/C++, Java, PHP, Python Apache, Tomcat Oracle, SQLite, and JDBC accessible database Services orchestrated by WF engine
Adobe content intelligence platform 1 Input document 2 Extract, structure, & create text 3 Create semantic metadata & tags 4 Normalize & persist metadata 5 Retrieve, filter, and analyze all metadata 6 Score metadata & create essence Content input Text extraction Metadata Generation Metadata Persistence Semantic Analysis Essence Generation Documents Upload interface Tools & utilities CMS adapters Layout extraction Page/section segmentation Text extraction Text glyph filtering Keyterm entity extractor Categorizer & theme analyzer Summarizers XMP metadata services Metadata persistence services Category & summary filters Category taxonomy rule engine Weight categories & themes Recommend rule-based categories < XML > Crawlers Stopword filtering Term stemming Other extractors & analyzers Metadata Repository Adobe keyterm ranker Recommend doc & page Keyterms Commercial & open source taxonomies Taxonomies & ontologies Domain taxonomies Generic taxonomies Taxonomy & ontology builder
Semantic analysis processing node i > Doc.Reg. process 01 Doc.Reg. process 02 Job Queue PDF file1 PDF file2 PDF file3 PDF file4 PDF file5 PDF file6 : Layout Keyterm Upload extraction entity interface extractor Page/section segmentation Categorizer & Tools & theme utilities Text extraction analyzer CMS Text glyph Documents adapters filtering Summari zers Stopword Crawlers filtering Other Term extractors & stemming analyzers XMP metadata services Metadata persistence services M e t a d a t a R e p o s t Category & summary filters Category taxonomy rule engine Adobe keyterm ranker Weight categories & themes Recommend rule-based categories Recommend doc & page Keyterms < X M L o r ysemantic analysis WF 01 Semantic analysis WF 02 Semantic analysis WF 03 : Semantic analysis WF 10 Semantic analysis WF 11 Semantic analysis WF 12 Semantic analysis WF 13 : Semantic analysis WF 20 Each semantic analysis workflow = 1 thread 10 analysis threads/process svr process 01 svr process 02 Doc.Reg. process 03 Semantic analysis WF 21 Semantic analysis WF 22 Semantic analysis WF 23 Semantic analysis WF 30 : svr process 03 Doc.Reg. process 04 Semantic analysis WF 31 Semantic analysis WF 32 Semantic analysis WF 33 Semantic analysis WF 40 : svr process 04
Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends
Screenshot Demo Ads for Adobe PDF Powered by Yahoo! Hosted in Adobe co-location Launched public beta Q4 2007 60+ publishers participating System workflows: User Registration Semantic Analysis PDF Interaction
Marketing Website @ Adobe Labs http://labs.adobe.com/technologies/adsforpdf
Login to Adobe Portal
Proceed to Adobe Portal
Example document: Travel guide for Canada PDF : 1 1000 pages, average = 5 pages, multiple subtopics Well written text, HS to college-level English Well-structured topically Domain terminology
Publish the PDF Adobe semantic metadata used to match against ad inventory On ad click, ad network provider, Adobe, and content publisher share ad revenue
Presentation Overview Background and motivation Challenges Semantic analysis techniques Platform architecture Screenshot demo Summary / lessons learned Future direction and trends
Summary Launched new semantic service: Ads for Adobe PDF Features in 1.1 Page-level analysis for page specific ads High volume registration and analysis scalability: publishers with millions of PDFs Adobe content intelligence platform using Semantic model of content multi-level semantic analysis Allows publishers to easily monetize content Combines: Statistical keyword analysis Document topic analysis and summarization Ontology and rules-based inferencing
Lessons Learned in 1.0 Need to use a hybrid semantic analysis approach: Statistical methods based on N-grams (TF/IDF) Ontologies are key: Machine learning and automatic construction Symbolic theme/topic inference engine Logic rule engines to deal with intentional semantics Document topic analysis problem: long documents, multiple topics Aboutness( X ) with generalization Segmentation Need to refine approach to topic segmentation (e.g., Hearst) Plan for ground-truth evaluations Large number of tuning points Use systematic (WF-wide) analysis tracing & logging Understand ad network inventory from provider Adapt to non-linear ad network behavior (revenue vs. relevance)
Future Direction and Trends Need for deeper semantic analysis of text Large scale computational linguistics Use broader knowledge base, e.g., Wikipedia, Google, the Web Automatic targeted ontology learning New vocabulary and topics Topic interrelationships User preference model based on Fine-grained model of content corpus Global user behavior Extensions to other media types: audio and video Speech-to-text Scene analysis, image/object identification