Big datasets - promise or Big data, shmig data. D.A. Forsyth, UIUC

Similar documents
Big Databases and Computer Vision D.E.A.A.R.A.F.A.R.A.R.F.

Face Identification by Human and by Computer: Two Sides of the Same Coin, or Not? Tsuhan Chen

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

The Scientific Data Mining Process

Knowledge-based systems and the need for learning

Object Recognition. Selim Aksoy. Bilkent University

Classifying Manipulation Primitives from Visual Data

Data Mining Classification: Decision Trees

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

CS171 Visualization. The Visualization Alphabet: Marks and Channels. Alexander Lex [xkcd]

Machine Learning Final Project Spam Filtering

Introduction. A. Bellaachia Page: 1

Abstraction in Computer Science & Software Engineering: A Pedagogical Perspective

CFSD 21 ST CENTURY SKILL RUBRIC CRITICAL & CREATIVE THINKING

HELP DESK SYSTEMS. Using CaseBased Reasoning

IT Challenges for the Library and Information Studies Sector

Colorado School of Mines Computer Vision Professor William Hoff


MACHINE LEARNING & INTRUSION DETECTION: HYPE OR REALITY?

Errors in Operational Spreadsheets: A Review of the State of the Art

Value and Challenges of Regularised Consultations and Information Sharing between Facility Security Managers 1

IMPROVING SETTLEMENT SAVVY. Kathy Perkins Kathy Perkins LLC, Workplace Law & Mediation

Sources of International Law: An Introduction. Professor Christopher Greenwood

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Perception of Light and Color

Easily Identify Your Best Customers

Visibility optimization for data visualization: A Survey of Issues and Techniques

Tutorial for proteome data analysis using the Perseus software platform

Artificial Intelligence

Appendix C. Sign Design Guidelines. Sign Types and Placement

Paralegal News. What am I feeling? Why am I feeling this way? What can I do about it?..then.what will I do about it?

The Role of Human Resource Management in Risk Management

OVERVIEW OF JPSEARCH: A STANDARD FOR IMAGE SEARCH AND RETRIEVAL

1 What is Machine Learning?

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Quantitative vs. Categorical Data: A Difference Worth Knowing Stephen Few April 2005

Information Management course

Big Data in the Mathematical Sciences

Element Property Definition for the Space Satellite

Experiments in Web Page Classification for Semantic Web

Statistics for BIG data

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities

SigmaRADIUS Leadership Effectiveness Report

Machine Learning and Statistics: What s the Connection?

Get The Picture: Visualizing Financial Data part 1

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Mini-Guide to Selecting and Working with Consultants

Organising g for Transparency. The ranking of Dutch hospital care

Categorical Data Visualization and Clustering Using Subjective Factors

Support Vector Machines with Clustering for Training with Very Large Datasets

Školská sociálna práca 191

Introduction to Data Mining

Centre for Learning and Development

Big Data: Image & Video Analytics

DATA QUALITY IN GIS TERMINOLGY GIS11

Chapter 3.3. Trading Psychology

Total Survey Error & Institutional Research: A case study of the University Experience Survey. Sonia Whiteley The Social Research Centre

Classification Appeal Decision Under Section 5112 of Title 5, United States Code

A Slip, Trip, and Fall Guide For Banks By Scott Simmonds

A workshop for Teachers of Young Children. How Blocks Stack Up. By Sharon MacDonald

Employer Health Insurance Premium Prediction Elliott Lui

Appendix B Data Quality Dimensions

Analyzing survey text: a brief overview

The Critical Skills Students Need



DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Machine Learning: Overview

Project Human Resource Management, PMBOK Forth Edition

Robust Real-Time Face Detection

Using Semantic Data Mining for Classification Improvement and Knowledge Extraction

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

Fact sheet. Tractable. Automating visual recognition tasks with Artificial Intelligence

Prepared for: Your Company Month/Year

TRAINING NEEDS ANALYSIS

The dental profession s peer review program is a process to resolve disputes between a

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

B-bleaching: Agile Overtraining Avoidance in the WiSARD Weightless Neural Classifier

Operator-Centric Design Patterns for Information Visualization Software

Get the most value from your surveys with text analysis

Teaching Methodology for 3D Animation

Blender Notes. Introduction to Digital Modelling and Animation in Design Blender Tutorial - week 9 The Game Engine

IP-S3 HD1. Compact, High-Density 3D Mobile Mapping System

Two big challenges in machine learning LÉON BOTTOU

Be visual all the time, and be verbal when labeling because explanations help clarify ideas.

KPMG Unlocks Hidden Value in Client Information with Smartlogic Semaphore

The Delicate Art of Flower Classification

Taking Inverse Graphics Seriously

Object class recognition using unsupervised scale-invariant learning

Contact Information. Phone

High Resolution RF Analysis: The Benefits of Lidar Terrain & Clutter Datasets

Visualization Quick Guide

T-REDSPEED White paper

Submission IPART - Benchmark Costs for Local Council Infrastructure Contributions

Graphical Representation of Multivariate Data

Sterman Counseling and Assessment

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Sexuality Education Curriculum Audit Tool

Environmental Remote Sensing GEOG 2021

Transcription:

Big datasets - promise or Big data, shmig data D.A. Forsyth, UIUC

Conclusion Q: What do big datasets tell us? A: Not much, if the emphasis is on size Collecting datasets is highly creative rather than a nuisance activity tools are getting better by the day Bias, weird frequencies are a major issue There are no best practices for avoiding problems May shape our representations Recognition problems are hard to frame excess certainty may be dangerous

Bias Should not be perjorative Frequencies in the data may misrepresent the application Because the labels are often wrong Because of what gets labelled eg pedestrians in crowds P(labelled X) is not uniform eg obscure but important objects in complex clutter Because of what gets collected eg. pictures from the web are selected - not like a camera on head eg. Profession labelling for faces in news pictures Label error Label bias Curation bias X=data

Induction is why bias matters Fundamental principle of machine learning if the world is like the dataset, then future performance will be like training Chernoff bounds, VC dimension, etc., etc. But what if the world can t be like the dataset?

Pedestrian Detection Pedestrian detection: We may not run down people who behave strangely want will fail to detect with frequency... can do... IF test set is like training set There is a large weight of easy cases which may conceal hard cases Resolution (frankly implausible) ensure that training set is like test set Resolution (perhaps) i.e. build models try only to learn things that are fairly represented in datasets

Representation is a response to bias Attributes Semantic parts Tying Example Ramanan s activity example where you are often reveals what you are doing but how do we encode where you are x-y coords? near the stove?

Conclusion Q: What do big datasets tell us? A: Not much, if the emphasis is on size Collecting datasets is highly creative rather than a nuisance activity tools are getting better by the day Bias, weird frequencies are a major issue There are no best practices for avoiding problems May shape our representations Recognition problems are hard to frame excess certainty may be dangerous

One belief space about recognition Platonism? Categories are fixed and known Each instance belongs to one category of k Object recognition=k-way classification current data sets ok in principle improve coverage collect unbiased datasets with fair coverage research agenda: more features, better classifiers: perhaps category hierarchies for statistical leverage (tying) Obvious nonsense Obvious nonsense I doubt this is possible I doubt this is possible

What have we inherited from this view? Deep pool of information about feature constructions Tremendous skill and experience in building classifiers Much practice at empiricism which is valuable, and hard to do right

Are these monkeys?

Another belief space about recognition Categories are highly fluid opportunistic devices to aid generalization affected by current problem instances can belong to many categories simultaneously at different times, the same instance may belong to different categories categories are shaded much within class variation is principled Most categories are rare Many might be personal, many are negotiated Understanding (recognition) constant coping with the (somewhat) unfamiliar bias is pervasive, affects representation

Visual complexity Some categories hard to detect, others easy? Farhadi + Sadeghi 11

Co-existing category systems Monkey or Plastic toy or both or irrelevant Some of this depends on what you re trying to do, in ways we don t understand Person or child or beer drinker or beer-drinking child or tourist or holidaymaker or obstacle or potential arrest or irrelevant or...

Research agenda How do we build bias-robust representations? What should we mean by category? how are categories created? how can multiple category systems co-exist? how can we sew together categorization and utility? What should we report about pictures? What should we say about things? What kind of clumps of meaning should we detect?