Word Taxonomy for On-line Visual Asset Management and Mining



Similar documents
MINING CLICKSTREAM-BASED DATA CUBES

Issues for On-Line Analytical Mining of Data Warehouses. Jiawei Han, Sonny H.S. Chee and Jenny Y. Chiang

II. OLAP(ONLINE ANALYTICAL PROCESSING)

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Universal. Event. Product. Computer. 1 warehouse.

Data Warehousing and OLAP Technology for Knowledge Discovery

John R. Vacca INSIDE

Text Mining: The state of the art and the challenges

Relational Databases and Data Warehouses æ. Jiawei Han Jenny Y. Chiang Sonny Chee Jianping Chen Qing Chen

Research Issues in Web Data Mining

Information Management course

SPATIAL DATA CLASSIFICATION AND DATA MINING

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Text Mining - Scope and Applications

Application of Data Mining Techniques in Intrusion Detection

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Mining various patterns in sequential data in an SQL-like manner *

CubeView: A System for Traffic Data Visualization

Chapter ML:XI. XI. Cluster Analysis

COURSE SYLLABUS COURSE TITLE:

College information system research based on data mining

Database Marketing, Business Intelligence and Knowledge Discovery

Web Content Mining Techniques: A Survey

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Web Mining Techniques in E-Commerce Applications

Analyzing Polls and News Headlines Using Business Intelligence Techniques

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Clustering Technique in Data Mining for Text Documents

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 4, July 2013

INTEROPERABILITY IN DATA WAREHOUSES

Delivering Smart Answers!

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

DATA WAREHOUSING AND OLAP TECHNOLOGY

Introduction. A. Bellaachia Page: 1

Data Warehousing and Data Mining

Spam Detection Using Customized SimHash Function

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Building a Database to Predict Customer Needs

Multimedia Data Mining: A Survey

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Designing an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract)

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Introduction. Introduction. Spatial Data Mining: Definition WHAT S THE DIFFERENCE?

Mining Text Data: An Introduction

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

A Critical Review of Data Warehouse

Chapter 3 Data Warehouse - technological growth

A SURVEY ON WEB MINING TOOLS

Application of Data Mining Methods in Health Care Databases

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

A Framework of User-Driven Data Analytics in the Cloud for Course Management

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Wee Keong Ng. Web Data Management. A Warehouse Approach. With 106 Illustrations. Springer

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Inner Classification of Clusters for Online News

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Dynamic Data in terms of Data Mining Streams

Caravela: Semantic Content Management with Automatic Information Integration and Categorization (System Description)

Data Mining for Web-Enabled Electronic Business Applications. Richi Nayak

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Spatial Data Warehouse and Mining. Rajiv Gandhi

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

A Hybrid Approach for Ontology Integration

User Profile Refinement using explicit User Interest Modeling

The basic data mining algorithms introduced may be enhanced in a number of ways.

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Use of Data Mining in the field of Library and Information Science : An Overview

Data mining in the e-learning domain

Visualization methods for patent data

OLAP Online Privacy Control

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

How To Use Text Mining To Expand Business Intelligence

Visual Data Mining in Indian Election System

Sanjeev Kumar. contribute

Web Mining Functions in an Academic Search Application

How To Find Out What A Web Log Data Is Worth On A Blog

MULTIMEDIA MINING RESEARCH AN OVERVIEW

Data Mining and Analysis of Online Social Networks

Analysis of Data Mining Concepts in Higher Education with Needs to Najran University

A Survey on Web Mining From Web Server Log

CHAPTER 1 INTRODUCTION

Indexing Techniques for Data Warehouses Queries. Abstract

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

CONCEPTCLASSIFIER FOR SHAREPOINT

Transcription:

Word Taxonomy for On-line Visual Asset Management and Mining Osmar R. Zaïane * Eli Hagen ** Jiawei Han ** * Department of Computing Science, University of Alberta, Canada, zaiane@cs.uaberta.ca ** School of Computing Science, Simon Fraser University, Canada, {hagen, han}@cs.sfu.ca Abstract We have designed and implemented MultiMediaMiner, a system prototype to mine high-level multimedia information and knowledge from large multimedia repositories like the WWW. WordNet, a semantic network for English, was used to clean and transform sets of keywords extracted from Web pages to index multimedia objects contained in these pages. WordNet was also enriched and used to generate concept hierarchies necessary for interactive information retrieval and the construction of multi-dimensional data cubes for multimedia data mining with MultiMediaMiner. 1.Introduction Data mining, the process of extracting implicit useful knowledge from large databases, has attracted the interest of many researchers in many fields. Even though data mining is well defined [6], the scope of mining and the range of information discovered are relative to the application and the type of data mined. Commonly, data mining tends to mine high-level information to extract rules for characterization, classification, clustering, association, etc. When using concept hierarchies it is possible to drill down along the hierarchy from a high level concept down to the raw data in the database. This process, if done intelligently, can also lead to interactive information retrieval. The data mining we propose is on large multimedia databases. In our implementation we chose the Internet as the source of images and video sequences. Due to its unstructured and dynamic nature, the Internet is not well suited for conventional data mining techniques. In order to constrain the semi-structured data by a schema, we favoured building a database of metadata describing multimedia objects on the Internet and mining the multimedia metadata database. Keeping links to

the original images and videos allows us to go back to the resources and thus accomplish resource discovery along with knowledge discovery. Along with the visual clues extracted from the images, we wanted to automatically associate keywords to the images. These keywords were extracted from the web pages. Research in data mining is flourishing and becoming widespread. Rapid progress is being reported in many areas and applications of data mining. However, despite the fact that multimedia has been the focus of many researchers for indexing, modeling, storing, etc., nothing substantial, in comparison to the relevance of the field, has been done in multimedia data mining, whereas there has been interesting research in text mining from textual documents [3,4] and Web or semistructured data querying and mining [12,8,1,10,5]. Concept hierarchies are frequently used in data mining and are often the central data structure. However, these concept hierarchies are built manually, making the process tedious and domain limited. In our implementation, we do not claim to have fully automated the construction of the concept hierarchy, but our algorithm builds a hierarchy rich enough that can be improved by domain experts. Any knowledge discovery process involves data collection, data cleaning and transformation, data normalization and representation, the mining itself, and discovered knowledge representation and visualization. For lack of space, in this abstract we focus solely on the data cleaning and transformation and data representation of the keywords extracted. 2.System Overview The MultiMediaMiner system [13,14] is a data mining system used to discover high-level information and knowledge from multimedia databases, like summarization of multimedia characteristics, classification of multimedia objects, and association between object features. It is a hybrid system borrowing technologies from DBMiner [7], a system for data mining in large relational databases and data warehouses, and C-BIRD [9], a system for Content-Based Image Retrieval from Digital libraries. While DBMiner applies multi-dimensional database structures, statistical data analysis, and machine learning approaches for mining different kinds of rules in relational databases and data warehouses, C-BIRD pre-processes images and videos to extract relevant features (colours, textures, layout, etc.), and matches queries with image and video features in the database it creates. Images and videos are extracted from Web pages and pre-processed to extract visual features. The Image Excavator [9], a web crawler especially built to traverse the Internet and extract images and videos, retrieves keywords from web pages that contain the images

and associates them with the images as linguistic descriptions. Instead of using all words in web pages as keywords, contextual information, like HTML tags in web pages, are used to derive these keywords. For example, image file name and path if it contains a word or recognizable words, ALT field in the IMG tag, HTML page title, HTML page headers, parent HTML page title, hyperlink to the image from parent HTML page, and neighbouring text before and after the image, META tag placed in the HEAD element of the HTML page, can disclose valuable keywords related to an image. Headers and titles in web pages are rarely complete phrases, and file names and directory paths (which are also used for keywords) are often truncated words, acronyms or created words (example Elvis2, Int8086, redcar, etc.). To solve the problem we decided to use a dictionary to identify legitimate words after simple transformation. The transformations concerns especially compound words like blue angels, and collated words like /images/airplane in a URL, or fighter1.gif in an image file name. After first transformation, a morphological analysis reduces words and verbs to their canonical form (base, infinitive, singular, etc.). The list of such transformed words is then cleaned to eliminate illicit or unrecognized words. A concept hierarchy of keywords is a directed acyclic graph of concepts. An arc from a concept A to a concept B indicates that A subsumes B denoting that A is more general than B. For example Boeing 747 is subsumed by commercial airplane which is subsumed by airplane. Building a concept hierarchy on keywords is not a trivial task. Most applications that use such a structure build them manually. To address both issues, cleaning keywords and building hierarchies automatically, we opted to use the on-line dictionary, thesaurus, and semantic network WordNet developed in the Linguistics department at the University of Princeton [11,2]. In a first stage, WordNet is used as a filter to eliminate unknown words. In a second stage, WordNet is consulted, as a partial order of words, to get subsumption relationships (i.e. child-parent) between words in order to build our concept hierarchy. Unfortunately, WordNet lacks some domain related words. In our first experiment, for instance, we were indexing airplane related web pages. Words like Boeing 747 or F-15 do not exist in WordNet and we had to enrich the semantic network in order to add valid words that were rejected in the first round. All terms that are filtered out and rejected by WordNet are checked by a domain expert and are added to a secondary semantic network that is anchored to WordNet. For example, Boeing 747 which is not recognized in the first stage is added to the secondary hierarchy under a new concept commercial airplane which anchored as a descendent to the concept airplane in WordNet. The secondary (domain related) semantic network becomes an integral part of the enriched WordNet and is used in the keyword cleaning and concept hierarchy

building. Figure 2 shows the integration of WordNet in the design of a text cleaning and concept building module. In Figure 1, thumbnails of commercial airplanes pertaining to the aircraft manufacturer Boeing are displayed. This user interface from MultiMediaMiner also allows the selection of a multimedia data set to be mined. The hierarchy of keywords on the left of Figure 1 is a section of the concept hierarchy automatically generated by visiting some web sites containing aircraft images. The user selects images by browsing the concept hierarchy and choosing a keyword at a given conceptual level, and zooms in to a lower conceptual level or zooms out to a higher conceptual level before retrieving the relevant images as source of the data mining process. The user interface can also be used as an on-line resource discovery mechanism by drilling through to the web pages containing the images. A major challenge is the mere size of a keyword dimension. Ordinarily, without counting the leaf nodes of a hierarchy, the size of concept hierarchies rarely exceeds 50 to 100 nodes in a data warehousing application, but eventually, a keyword hierarchy will reach several thousand nodes even for a restricted domain. It will eventually become impossible to deal with the keyword dimension in a traditional manner since the data cube becomes too sparse and too large to handle in main memory. New algorithms for handling the multi-dimensional data cube should be introduced. A second challenge is the multi-valued nature of the keyword domain. An image may have many keywords associated to it. This makes difficult to add the keyword dimension to the multidimensional data cube since the aggregate values in the data cube are based on one value per dimension per image. For our current implementation, we have left the keyword dimension outside the data cube, limiting our data mining possibilities to only the remaining image attributes. Keywords are only used to select images for analysis. We are working on a new data structure that would allow us to take advantage of the multi-valued nature of the keyword dimension inside the data cubes, and thus be able to include the keywords in our data mining and on-line analytical processing which would be very useful for visual asset management. References [1] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener. The LOREL query language for semistructured data, 1997. http://www.db.stanford.edu/~abitebou/pub/jodl97.lorel96.ps. [2] R. Beckwith, C. Fellbaum, D. Gross, K. Miller, G.A. Miller, and R. Tengi. Five papers on WordNet. Special Issue of Journal of Lexicography, 3(4): 235-312, 1990. Also available from ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps.

Web Pages Images and Videos Page and Image URLs Image and Video Processing Feature vectors Metadata WordNet Raw Keywords Data Cleaning Normalized Keywords Hierarchy Building Concept Hierarchy Figure 1: Selecting (and browsing) data sets of images using keyword. Figure 2: Use of WordNet for keyword Cleaning and building Concept Hierarchie. [3] R. Feldman and I. Dagan. Knowledge discovery in textual databases (KDT). In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, pages 112-117, Montreal, Canada, Aug. 1995. [4] R. Feldman and H. Hirsh. Mining associations in text in the presence of background knowledge. In Proc. 2st Int. Conf. Knowledge Discovery and Data Mining, pages 343-346, Portland, Oregon, Aug. 1996. [5] D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the world-wide web: A survey. ACM SIGMOD Record, 27(3):59-74, September 1998. [6] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge discovery in databases: An overview. In G. Piatetsky-Shapiro and W. J. Frawley, ed. Knowledge Discovery in Databases, pages 1--27. AAAI/MIT Press, 1991. [7] J. Han, J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, G. Liu, K. Koperski, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaïane, S. Zhang, and H. Zhu. DBMiner: A system for data mining in relational databases and data warehouses. In Proc. CASCON'97: Meeting of Minds, pages 249-260, Toronto, Canada, November 1997. [8] D. Konopnicki and O. Shmueli. W3QS: A query system for the world-wide web. In Proc. 21st Int. Conf. on Very Large Data Bases (VLDB), pages 54-65, Zurich,Switzerland, 1995. [9] Z-N. Li, O. R. Zaïane, and Z. Tauber. Illumination invariance and object model in content-based image and video retrieval. Journal of Visual Communication and Image Representation, 1999. [10] A. Mendelzon and T. Milo. Formal models of web queries. In ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 1997. [11] WordNet - a lexical database for english. http://www.cog\-sci.prince\-ton.edu/~wn/, 1998. [12] O. R. Zaïane and J. Han. Resource and knowledge discovery in global information systems: A preliminary design and experiment. In Proc. First Int. Conf. On Knowledge Discovery and Data Mining, Montreal, Canada, 1995. [13] O. R. Zaïane, J. Han, Z-N. Li, S.H.Chee, and J.Y. Chiang. MultiMediaMiner: A system prototype for multimedia data mining. In SIGMOD'98, 1998. [14] O. R. Zaïane, J. Han, Z-N. Li, and J. Hou. Mining multimedia data. In CASCON'98: Meeting of Minds, Toronto, Canada, November 1998.