A picture is worth five captions

Similar documents

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies

Applications of Deep Learning to the GEOINT mission. June 2015

Web 3.0 image search: a World First

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Supporting Online Material for

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction to Pattern Recognition

This Document Contains:

Learning is a very general term denoting the way in which agents:

The Scientific Data Mining Process

3D U ser I t er aces and Augmented Reality

Deep learning applications and challenges in big data analytics

Chapter 7. Process Analysis and Diagramming

Data, Measurements, Features

Data Storage 3.1. Foundations of Computer Science Cengage Learning

An Introduction to Data Mining

Voice Driven Animation System

Search and Information Retrieval

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Steven C.H. Hoi School of Information Systems Singapore Management University

Graphic Design. Background: The part of an artwork that appears to be farthest from the viewer, or in the distance of the scene.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Prepositions. TLC/College of the Canyons. Prepared by Kim Haglund, M.Ed: TLC Coordinator

Image Classification for Dogs and Cats

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

DIVERSE UNIVERSE ELEMENTARY LESSON PLAN

Social Media Mining. Data Mining Essentials

Taking Inverse Graphics Seriously

Professor, D.Sc. (Tech.) Eugene Kovshov MSTU «STANKIN», Moscow, Russia

A Short Introduction to Computer Graphics

A. Dentist s office B. Eye Doctor s office C. Doctor s office

E6893 Big Data Analytics: Yelp Fake Review Detection

Learning English with Bobby 1 Agricola-kustannus

English (Literacy) Reading Target Ideas. P Scale 4

Machine Learning for Data Science (CS4786) Lecture 1

Public Safety Communciations Pre-Employment Test Preparation Guide

Why is Internal Audit so Hard?

Young Learners English

Dynamical Clustering of Personalized Web Search Results

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Conceptual Integrated CRM GIS Framework

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

MINIMUM STANDARDS OF OPERATION FOR ALZHEIMER S DISEASE/DEMENTIA CARE UNIT: GENERAL ALZHEIMER S DISEASE/DEMENTIA CARE UNIT

Advancements in Close-Range Photogrammetry for Crime Scene Reconstruction

PROGRAM OUTLINE PAGE 1

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Nombre: RED GREEN BLUE YELLOW PINK ORANGE. Color according to the instructions. Count and write the number. Celia Rodríguez Ruiz

First Grade Spelling Words

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

SEO FOR VIDEO: FIVE WAYS TO MAKE YOUR VIDEOS EASIER TO FIND

Data Mining. Nonlinear Classification

Workshop Discussion Notes: Interpretation Gone Wrong

Introduction to Machine Learning Using Python. Vikram Kamath

EĞİTİM-ÖĞRETİM YILI ÖZEL BAHÇELİEVLER İHLAS ANAOKULU İNGİLİZCE DERSİ ÜNİTELENDİRİLMİŞ YILLIK DERS PLANI

A Short Course in Logic Example 8

Persuasive Speech Outline. Specific Purpose: To persuade my audience to not get distracted while driving so they will not get into a car accident

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Using Artificial Intelligence to Manage Big Data for Litigation

Barcodes in the Classroom

Fast Matching of Binary Features

School of Computer Science

Kindergarten Workbook 1 & 2 Samples

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Questions to Consider in UDL Observations of Early Childhood Environments

1 o Semestre 2007/2008

Introduction to Logistic Regression

Bloom s Taxonomy. List the main characteristics of one of the main characters in a WANTED poster.

TIPS AND IDEAS FOR MAKING VISUALS

BRAIN ASYMMETRY. Outline

Neural Network based Vehicle Classification for Intelligent Traffic Control

How to write a technique essay? A student version. Presenter: Wei-Lun Chao Date: May 17, 2012

The Winchester School Family Learning Newsletter (FS 2) March 2015

Neural Networks for Sentiment Detection in Financial Text

TIETS34 Seminar: Data Mining on Biometric identification

Digital 3D Animation

Numerical Algorithms for Predicting Sports Results

Communication and Icebreaker Exercises

FREE Teaching Applications Summer 2012

Performance Metrics for Graph Mining Tasks

Course: Model, Learning, and Inference: Lecture 5

Big Data in medical image processing

Chapter 12: Observational Learning. Lecture Outline

FAMILY. MUM, DAD Mum, mum, dad, dad, sister, sister, brother, brother, Shhh, shhh, baby, baby.

High School Algebra Reasoning with Equations and Inequalities Solve equations and inequalities in one variable.

CHAPTER 6 PRINCIPLES OF NEURAL CIRCUITS.

FEDERAL RESERVE BANKS OF ST. LOUIS AND PHILADELPHIA

Search Result Optimization using Annotators

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Programming Exercise 3: Multi-class Classification and Neural Networks

WP5 - GUIDELINES for VIDEO shooting

Issues in Information Systems Volume 16, Issue IV, pp , 2015

Content-Based Image Retrieval

Our Visit to The Clearwater Main Public Library Children s Area

Transcription:

A picture is worth five captions Learning visually grounded word and sentence representations Grzegorz Chrupała (with Ákos Kádár and Afra Alishahi)

Learning word (and phrase) meanings Cross-situational Distributional the cat sat on the mat the dog chased the cat funniest cat video ever lol

Distributional Very popular in Cogsci and NLP LSA, LDA, word2vec, Massive amounts of data Recent focus on compositionality

Cross-situational Synthetic data (Fazly et al. 2010) Coded scene representations (Frank et al. 2009) But natural scenes are not sets of symbols

Real scenes Harder objects need to be identified invariances detected But also easier better opportunities for generalization

Captioned images Young et al. 2014 Denotational semantics only use images as opaque ids Several works on generating captions use actual image features

Visually grounded word and sentence representations Learn from linguistic context (non-symbolic) visual context Compositionality Word, phrase and sentence representations

Aligning words and image features Based on word learning model for synthetic data Cognitive Science, 2010

Feature-word alignment Joe happily CNN eating an apple

Visual word vectors (4096 dimensions) apple pear eat drink

Evaluation Unlike for synthetic data no ground truth Indirect evaluation correlation with human similarity judgments what exactly do we get when we ask for these judgments? search images based on captions generate captions for images paraphrase captions.

Correlations with human judgments (Flickr30K)

Predicting ImageNet labels from word representations Label: aircraft carrier, carrier, flattop Label: stove Hypernym: vehicle Hypernym: device Predicted: distant, boats, ship, houses Predicted: fire, candles, taken, lit

But need more Integrate linguistic and visual context Representations of phrases and complete sentences Start from scratch

IMAGINET Multi-task language/image model Neural network model Generality Separation of modeling and learning algorithm Reusable building blocks Successful in a variety of tasks including captioning But, opaque internal states Need techniques to help interpretability

Word Embeddings Textual Visual Pathway Pathway Joe happily eating CNN an apple END

Compared to captioning Captioning (e.g. Vinyals et al. 2014) Start with image vector Output caption word-by-word conditioning on image and seen words IMAGINET Read caption word-by-word Incrementally build sentence representation while also predicting the coming word Finally, map to image vector

Compared to compositional distributional semantics word embeddings distributional word vectors hidden states sentence vectors input-to-hidden weights projection to sentence space hidden-to-hidden weights composition operator All these are learned based on supervision signal from the two tasks

Some details Shared word embeddings 1024 units Pathways Gated Recurrent Unit nets 1024 clipped rectifier units Image representations: 4096 dimensions Multi-task objective

Multi-task objective LT cross-entropy loss (mean negative log probability of next word) LV mean squared error α = 0 purely visual model α = 1 purely textual model 0 < α < 1 multi-task model

Bag-of-words linear regression as a baseline How much do embeddings and recurrent nets contribute? Baseline Input: word-count vector Output: image vector L2-penalized sum-of-squared errors regression

Dataset

Correlations with human jugdments

Image retrieval and sentence structure Project original and scrambled caption to visual space Rank images according to cosine similarity to caption

a pigeon with red feet perched on a wall. feet on wall. pigeon a red with a perched

a brown teddy bear lying on top of a dry grass covered ground a a of covered laying bear on brown grass top teddy ground. dry

a variety of kitchen utensils hanging from a UNK board. kitchen of from hanging UNK variety a board utensils a.

Paraphrase retrieval Record the final state along the visual pathway for a (maybe scrambled) caption For each caption, rank others according to cosine similarity Are top-ranked captions about the same image?

Paraphrase retrieval

a cute baby playing with a cell phone small baby smiling at camera and talking on phone. a smiling baby holding a cell phone up to ear. a little baby with blue eyes talking on a phone. phone playing cute cell a with baby a someone is using their phone to send a text or play a game. a camera is placed next to a cellular phone. a person that 's holding a mobile phone device

a couple of horses UNK their head over a rock pile two brown horses hold their heads above a rocky wall. two horses looking over a short stone wall. rock couple their head pile a a UNK over of horses an image of a man on a couple of horses looking in to a straw lined pen of cows

Currently working on Encourage complete sentence representations along textual pathway Disentangle relative contribution of longer-range predictions caption reconstruction word embeddings recurrent state Controlled manipulation of inputs

Long term Character-level input proof of concept working Direct audio input Need better story on what should be learned from data what should be hard-coded, or evolved

Thanks!

Gated recurrent units

Character level character embeddings: 128 units GRUs: 1024 Accuracy@5: 15%

IMAGINET

Retrieving ImageNet pictures Model Visual MultiTask LinReg Accuracy@5 0.38 0.38 0.33

Cosine and standardization