T-61.6010 Non-discriminatory Machine Learning

Similar documents
Data Mining. Toon Calders TU Eindhoven

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining - Evaluation of Classifiers

Why do statisticians "hate" us?

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites

Obtaining Value from Big Data

Understanding Characteristics of Caravan Insurance Policy Buyer

Data Mining for Fun and Profit

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Transparency of Hospital Productivity Benchmarking

Random Forest Based Imbalanced Data Cleaning and Classification

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining Part 5. Prediction

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining: Overview. What is Data Mining?

Workshop Discussion Notes: Housing

Application of Predictive Analytics to Higher Degree Research Course Completion Times

DATA MINING TECHNIQUES AND APPLICATIONS

Response to Critiques of Mortgage Discrimination and FHA Loan Performance

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Chapter 7: Data Mining

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Data Mining Algorithms Part 1. Dejan Sarka

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

CODE FOR MANAGEMENT PRACTICES: EQUALISING OPPORTUNITIES PART 1: CREATION OF EMPLOYMENT RELATIONSHIP 1. JOB DESCRIPTION

Predicting borrowers chance of defaulting on credit loans

Comparison of Data Mining Techniques used for Financial Data Analysis

Data quality in Accounting Information Systems

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

EQUALITY AND DIVERSITY POLICY & PROCEDURE MICHAEL W HALSALL (SOLICITORS)

Supervised Learning (Big Data Analytics)

1 Maximum likelihood estimation

Data Mining is the process of knowledge discovery involving finding

not possible or was possible at a high cost for collecting the data.

Benchmarking of different classes of models used for credit scoring

A Comparison of Variable Selection Techniques for Credit Scoring

Anomaly detection. Problem motivation. Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Information Governance Strategy

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Azure Machine Learning, SQL Data Mining and R

Sentiment analysis using emoticons

Data Mining Applications in Higher Education

Protein Protein Interaction Networks

Introduction. A. Bellaachia Page: 1

Introduction to Learning & Decision Trees

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Cross Validation. Dr. Thomas Jensen Expedia.com

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

Predict Influencers in the Social Network

NSF Workshop on Big Data Security and Privacy

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

How To Develop Software

Introduction to Data Mining

Application of Predictive Model for Elementary Students with Special Needs in New Era University

BIG DATA WITHIN THE LARGE ENTERPRISE 9/19/2013. Navigating Implementation and Governance

Predicting the Stock Market with News Articles

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining Classification: Decision Trees

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Data Mining Methods: Applications for Institutional Research

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning in Hospital Billing Management. 1. George Mason University 2. INOVA Health System

Weather forecast prediction: a Data Mining application

Data Mining for Business Analytics

Introduction to Pattern Recognition

TIETS34 Seminar: Data Mining on Biometric identification

Data Mining Introduction

Learning is a very general term denoting the way in which agents:

The Scientific Data Mining Process

Context Aware Predictive Analytics: Motivation, Potential, Challenges

TURKISH ORACLE USER GROUP

Healthcare Data Mining: Prediction Inpatient Length of Stay

Statistics for BIG data

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Healthcare data analytics. Da-Wei Wang Institute of Information Science

High Productivity Data Processing Analytics Methods with Applications

An Introduction to Machine Learning and Natural Language Processing Tools

Data Mining and Automatic Quality Assurance of Survey Data

Database Marketing, Business Intelligence and Knowledge Discovery

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

UNITED STATES JUDO REFEREES CODE OF ETHICS, STANDARDS AND CONDUCT

Foundations of Business Intelligence: Databases and Information Management

Transcription:

T-61.6010 Non-discriminatory Machine Learning Seminar 1 Indrė Žliobaitė Aalto University School of Science, Department of Computer Science Helsinki Institute for Information Technology (HIIT) University of Helsinki 16 September, 2015

Public attention

Digital universe Communication data Mobility data, traffic Transactional data Multimedia Internet of things In 2013, 2/3 of the digital universe bits were created or captured by consumers and workers enterprises had liability or responsibility for 85% of the digital universe IDC survey 2014 for EMC In 2013, 22% of this is candidate for analysis 5% is actually analyzed By 2020 35% will be candidate for analysis

Obama reports 2014 on Big data Decisions informed by big data could have discriminatory effects even in the absence of discriminatory intent Policy recommendation: to expand technical expertise to stop discrimination

Policy making Finland: new non-discrimination act came into force in Jan 2015 New EU non-discrimination directive is in preparation expands discrimination grounds and areas; widens definitions Obama report 2014 on big data expands the scope of protection: public and private activities, ethnic origin, age, nationality, language, religion, belief, opinion, health, disability, sexual orientation or other personal characteristics decisions informed by big data could have discriminatory effects even in the absence of discriminatory intent Increasing attention to digital discrimination Fairness, transparency and accountability in machine learning workshops, projects, public statements

Why care? Human decision makers may discriminate occasionally Algorithms would discriminate systematically and continuously Algorithms are often considered to be inherently objective But models are as good as data and modeling assumptions Algorithms may capture human biases, and may exaggerate Why care? To protect vulnerable people? Law requires? As computer scientists we are held accountable for algorithm performance, and need to be able to control and explain what is happening

Research attention

book journal special issue 2014 22(2)

Research background

AirBnB case

AirBnB case Harvard study: non-black hosts charge approximately 12% more than black hosts for the equivalent rental in New York city Out of control of the company: the crowd discriminates What if AirBnB learns a price recommender on this data??

Other examples Big data Personalized pricing, recommendations, personalized ads CV screening, salary estimation Personalized medicine Navigation and route planning Learning support (education), sports and welling More traditional applications Credit scoring, insurance Spam filtering Crime prediction, profiling University acceptance, funding decisions

Machine learning and discrimination Discrimination inferior treatment based on ascribed group rather than individual merits Machine learning enforcing constraints defined by legislation not judging what is morally wrong or right Direct vs. indirect discrimination twins test redlining

Can algorithms discriminate? Algorithms can discriminate when data is incorrect due to discriminatory decisions in the past population is changing over time data is incomplete (omitted variable bias) sampling bias Typically indirect discrimination Algorithms can discriminate even when the protected characteristic is not part of the equations

Machine learning and discrimination y polarized Discrimination inferior treatment based on ascribed group rather than individual merits X Predictive models y f(x) s

Source: "Home Owners' Loan Corporation Philadelphia redlining map". Licensed under Public Domain via Wikipedia

Solutions?

Removing protected characteristic y f(x,s) y f(x)

Sneetches Dr. Seuss, 1961 http://www.stevehackman.net/wp-content/uploads/2013/02/sneetches.jpg

Removing protected characteristic does not solve the problem if s is correlated with X desired: y f(x) what happens: y f(x,s*), s* f(x) X s y X s y X s y

Removing protected characteristic does not solve the problem if s is correlated with X desired: y f(x) what happens: y f(x,s*), s* f(x) X s y No problem X s Problem! y X s y No problem

Removing protected characteristic does not solve the problem if s is correlated with X desired: y f(x) what happens: y f(x,s*), s* f(x) X y No problem s Problem! X X y y No problem

Kamiran et al 2010

Computational discrimination research: discovery vs. prevention Discovery by statistics, economics communities since 80s, typically in mortgage lending, insurance or job admission/lay-offs typically based on regression (look at coefficients) or statistical hypothesis testing (equality of means) Prevention by machine learning/ data mining community relatively new topic, since 2008-2009 mostly focused on classification so far a lot of challenges ahead

Measuring Very open topic! Basic measure D = p(+ s) p(+ not s) or p(+ s)/p(+ not s), or p(+ s)/p(+),... Taking into account explanatory features easy to measure for discovery (hypothesis testing), difficult for prevention p(+ X,s) p(+ X,not s) Taking into account different decision thresholds normalized measures Romei and Ruggieri 2014, Mancuhan and Clifton 2014, Žliobaitė 2015

Prevention solutions Preprocessing Modify input data X, s or y Resample input data Regularization Postprocessing Modify models Modify outputs

Modify input data Modify y - massaging Kamiran and Calders 2009

Modify input data Modify X massaging Any attributes in X that could be used to predict s are changed such that a fairness constraint is satisfied approach is similar to sanitizing datasets for privacy preservation Feldman et al 2014

Resample Preferential sampling Kamiran and Calders 2010

Regularization Data subset due to split Regular tree induction Decision tree Entropy wrt class label Non-discriminatory tree Entropy wrt protected characteristic Tree induction IGC - IGS Kamiran et a 2010

Postprocessing Modifying model Relabel tree leaves to remove the most discrimination with the least damage to the accuracy Kamiran et a 2010

Prevention solutions Preprocessing Modify input data X, s or y Resample input data Regulatization Postprocessing Modify models Modify outputs From legal perspective Decision manipulation very bad Data manipulation quite bad Protected characteristic should not be used in decision making

Challenges ahead Impact challenges What is the scope of potentially discriminatory applications? Businesses are reluctant to collaborate, afraid of negative publicity Public is not concerned thinking that algorithms are always objective Research challenges Defining the right discrimination measures and optimization criteria Translating legal requirements into mathematical constraints and back Transparency and interpretability of the solutions is critical stakeholders need to understand and trust the solutions

Thanks!