Canonical Image Selection for Large-scale Flickr Photos using Hadoop



Similar documents
MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Map-Reduce for Machine Learning on Multicore

Image Search by MapReduce

How Companies are! Using Spark

Industry 4.0 and Big Data

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Map/Reduce Affinity Propagation Clustering Algorithm

UPS battery remote monitoring system in cloud computing

COMP9321 Web Application Engineering

Interactive person re-identification in TV series

Optimization of Image Search from Photo Sharing Websites Using Personal Data

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Big Data: Image & Video Analytics

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

An Overview of Knowledge Discovery Database and Data mining Techniques

Big Data and Analytics: Challenges and Opportunities

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Log Mining Based on Hadoop s Map and Reduce Technique

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

How To Handle Big Data With A Data Scientist

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Using Data Mining and Machine Learning in Retail

Customer Case Study. Sharethrough

Machine Learning over Big Data

Web Mining using Artificial Ant Colonies : A Survey

Cloud Computing and the Future of Internet Services. Wei-Ying Ma Principal Researcher, Research Area Manager Microsoft Research Asia

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Clustering Technique in Data Mining for Text Documents

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

XpoLog Competitive Comparison Sheet

Distributed Framework for Data Mining As a Service on Private Cloud

Learn to Personalized Image Search from the Photo Sharing Websites

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Security and Trust in social media networks

Machine Learning using MapReduce

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Big Data: Overview and Roadmap eglobaltech. All rights reserved.

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Large-Scale Test Mining

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analyzing Big Data with AWS

HadoopTM Analytics DDN

Where is... How do I get to...

Analysis of Social Media Streams

Scalable Developments for Big Data Analytics in Remote Sensing

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1

Realizing a Vision Interesting Student Projects

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Exploring Big Data in Social Networks

The basic data mining algorithms introduced may be enhanced in a number of ways.

Moving From Hadoop to Spark

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

ANALYTICS IN BIG DATA ERA

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

ConTag: Conceptual Tag Clouds Video Browsing in e-learning

Big RDF Data Partitioning and Processing using hadoop in Cloud

BIG DATA TRENDS AND TECHNOLOGIES

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

PhoCA: An extensible service-oriented tool for Photo Clustering Analysis

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Introduction to Data Mining

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Data Warehousing and Data Mining

Big Data on Microsoft Platform

2015 The MathWorks, Inc. 1

Outline. What is Big data and where they come from? How we deal with Big data?

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Customized Report- Big Data

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Predictive Analytics

From GWS to MapReduce: Google s Cloud Technology in the Early Days

SOCIAL NETWORK DATA ANALYTICS

Transcription:

Canonical Image Selection for Large-scale Flickr Photos using Hadoop Guan-Long Wu National Taiwan University, Taipei Nov. 10, 2009, @NCHC Communication and Multimedia Lab ( 通 訊 與 多 媒 體 實 驗 室 ), Department of Computer Science and Information Engineering, NTU ( 台 大 資 訊 系 ) http://www.csie.ntu.edu.tw/~b95109 Note that parts of the slides are thanks to Prof. Winston Hsu and Liang-Chi Hsieh, CMLab, NTU Team Members (MiRA group, CMLab, NTU) Prof. Winston H. Hsu Liang-Chi Hsieh Kuan-Ting Chen Chien-Hsing Chiang Yi-Hsuan Yang Guan-Long Wu Chun-Sung Ferng Hsiu-Wen Hsueh Angela Charng-Rurng Tsai 2 1

Who am I? A senior undergraduate student of NTU CSIE Research Interests Multimedia (CMLab, NTU. Advisor: Winston H. Hsu) Artificial Intelligence (iagent Lab, NTU. Advisor: Jane Yung-jen Hsu) Bioinformatics (NYMU. Advisor: Yeou-Guang Tsay) Contact b95109@csie.ntu.edu.tw 3 Outline Introduction context cues in social media Efficient image search result clustering Demo Concept of Hadoop Implementation Image Pairwise Image Similarity Affinity propagation Comparing with previous approaches Conclusions 4 2

Challenges and Opportunities from Large-Scale Social Media - Flickr (4B+ photos) - YouTube (25 hrs upload / min) - Web (10B videos watched / month) - Digital photos (50B / year) Growing practice of online media sharing Billion-scale magnitude Bringing profound impacts to new applications and user scenarios The technologies do not keep pace with the growth e.g., search, mining, visualization, and other promising applications 5 Rich Context Cues in Social Media Flickr Example notes object-level comments Visual features - color, texture - visual word Concept detectors (tower, sky, building) User-provided tags Time stamp yy/mm/dd Geotag Rich textual and visual cues, device metadata, and user interactions for social and organizing purposes Geo-locations, time, camera settings (e.g., shutter speed, focal length, flash, etc.) User-provided tags, descriptions, notes, etc. Comments, bookmarks,favorites (subjective) 6 3

Social Media Visualization Select canonical views to represent a landmark [Kennedy et al., WWW 08] Apply clustering algorithm (e.g. K-means) from tagged photos Select one image from each cluster (assumed to be visually dissimilar) Extremely time-consuming and NOT for online image search result clustering Pair-wise similarity Clustering algorithms 7 Efficient image search result clustering Feature Textual and visualbased Graph-based clustering Semantic image groups Organization Display Current Keyword-based N/A Image list Proposed Canonical Images Text-based similarity Browsing by image groups 8 4

Image Pairwise Image Similarity with MapReduce Goal Speeding up image pairwise cosine similarity calculation by MapReduce (Hadoop) over large-scale images, represented by large VWs Constructing similarity hyperlinks in image collections for visualization and improving search quality; offline computation tf-idf cut is more powerful than df-cut when dealing with VWs. 1400 1200 69+ times speed-up over 18 Hadoop nodes with similar 1000 800 600 400 time (sec) MAP improv. (%) performance (MAP) (11K images with 10K 0% VWs) 20% 10% -10% -20% 200-30% 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 thresholds for dropping visual words -40% 9 Cloud computing Leveraging MapReduce framework to scale up graph construction Computing huge image graph on a 18-node Hadoop cluster dataset Single machine Hadoop Platform Flickr11k 1.6hrs 83 secs Flickr550k unknown 42 mins 10 5

Offline clustering Clustering and canonical (representative) image selection by Hadoop-based Affinity Propagation 11 On-the-fly image search result clustering Real-time image search result clustering by pulling from precomputed clusters Canonical images Rome 12 6

Demo! demo Canonical Images Thumbnails and image viewer 13 Image Pairwise Image Similarity with MapReduce Indexing phase: vector inverted index (utilize sparse vectors) d1 map (F1,(d1,2)) (F2,(d1,8)) (F1,[(d1,2),(d3,5)]) reduce d2 map (F2,(d2,4)) aggregation (F1,[(d1,2),(d3,5)]) d3 map (F1,(d3,5)) (F2,(d3,1)) (F2,[(d1,8),(d2,4), (d3,1)]) reduce (F2,[(d1,8),(d2,4), (d3,1)]) 14 7

Image Pairwise Image Similarity with MapReduce Calculation phase: inverted index pairwise similarity (F1,[(d1,2),(d3,5)]) ((d1,d2),[32]) reduce map ((d1,d3),10) ((d1,d2),32) aggregation ((d2,d3),[4]) reduce map (F2,[(d1,8),(d2,4), (d3,1)]) ((d1,d2),32) ((d1,d3),8) ((d2,d3),4) ((d1,d3),[10,8]) ((d2,d3),4) reduce ((d1,d3),18) 15 Affinity propagation [Frey et al., Science, 07] Data points can be exemplar (cluster center) or non-examplar (other data points). Message is passed between exemplar (centroid) and non-exemplar data points. The total number of clusters will be automatically found by the algorithm. 16 8

Hadoop Implementation of Affinity Propagation [Wang et al. ICHL 2008] S: similarity s(i, k) R: responsibility r(i, k) A: Availability a(i, k) InitMapper: Initiate Input File to Specific Format SA2RReducer: Calculate R from S and A Iteration CleanReducer: Find exemplars on image graph R2AReduer: Calculate A from R 17 Comparing with previous approaches Response Time Feature Scalability SRC-based[1] Fast Textural only No Onlineclustering[2] Slow Visual only No Our approach[3] Faster Textural and Visual Yes [1] Feng Jing et al., IGroup: web image search results clustering, ACM MM 2006 [2] Reinier H. van Leuken et al., Visual diversification of image search results, WWW 2009 [3] Hsieh et al., Canonical Image Selection and Efficient Image Graph Construction for Large- Scale Flickr Photos, ACM MM 2009 18 9

Conclusions The proposed system can organizing image search results in semantic clusters at query time. The efficiency is achieved with the help of offline-computed image context graphs by distributed computing methods. 19 Acknowledgements National Center for High-Performance Computing (NCHC), Taiwan, for the Hadoop platform and technical supports in cloud computing 20 10