T-61.6010 Non-discriminatory Machine Learning Seminar 1 Indrė Žliobaitė Aalto University School of Science, Department of Computer Science Helsinki Institute for Information Technology (HIIT) University of Helsinki 16 September, 2015
Public attention
Digital universe Communication data Mobility data, traffic Transactional data Multimedia Internet of things In 2013, 2/3 of the digital universe bits were created or captured by consumers and workers enterprises had liability or responsibility for 85% of the digital universe IDC survey 2014 for EMC In 2013, 22% of this is candidate for analysis 5% is actually analyzed By 2020 35% will be candidate for analysis
Obama reports 2014 on Big data Decisions informed by big data could have discriminatory effects even in the absence of discriminatory intent Policy recommendation: to expand technical expertise to stop discrimination
Policy making Finland: new non-discrimination act came into force in Jan 2015 New EU non-discrimination directive is in preparation expands discrimination grounds and areas; widens definitions Obama report 2014 on big data expands the scope of protection: public and private activities, ethnic origin, age, nationality, language, religion, belief, opinion, health, disability, sexual orientation or other personal characteristics decisions informed by big data could have discriminatory effects even in the absence of discriminatory intent Increasing attention to digital discrimination Fairness, transparency and accountability in machine learning workshops, projects, public statements
Why care? Human decision makers may discriminate occasionally Algorithms would discriminate systematically and continuously Algorithms are often considered to be inherently objective But models are as good as data and modeling assumptions Algorithms may capture human biases, and may exaggerate Why care? To protect vulnerable people? Law requires? As computer scientists we are held accountable for algorithm performance, and need to be able to control and explain what is happening
Research attention
book journal special issue 2014 22(2)
Research background
AirBnB case
AirBnB case Harvard study: non-black hosts charge approximately 12% more than black hosts for the equivalent rental in New York city Out of control of the company: the crowd discriminates What if AirBnB learns a price recommender on this data??
Other examples Big data Personalized pricing, recommendations, personalized ads CV screening, salary estimation Personalized medicine Navigation and route planning Learning support (education), sports and welling More traditional applications Credit scoring, insurance Spam filtering Crime prediction, profiling University acceptance, funding decisions
Machine learning and discrimination Discrimination inferior treatment based on ascribed group rather than individual merits Machine learning enforcing constraints defined by legislation not judging what is morally wrong or right Direct vs. indirect discrimination twins test redlining
Can algorithms discriminate? Algorithms can discriminate when data is incorrect due to discriminatory decisions in the past population is changing over time data is incomplete (omitted variable bias) sampling bias Typically indirect discrimination Algorithms can discriminate even when the protected characteristic is not part of the equations
Machine learning and discrimination y polarized Discrimination inferior treatment based on ascribed group rather than individual merits X Predictive models y f(x) s
Source: "Home Owners' Loan Corporation Philadelphia redlining map". Licensed under Public Domain via Wikipedia
Solutions?
Removing protected characteristic y f(x,s) y f(x)
Sneetches Dr. Seuss, 1961 http://www.stevehackman.net/wp-content/uploads/2013/02/sneetches.jpg
Removing protected characteristic does not solve the problem if s is correlated with X desired: y f(x) what happens: y f(x,s*), s* f(x) X s y X s y X s y
Removing protected characteristic does not solve the problem if s is correlated with X desired: y f(x) what happens: y f(x,s*), s* f(x) X s y No problem X s Problem! y X s y No problem
Removing protected characteristic does not solve the problem if s is correlated with X desired: y f(x) what happens: y f(x,s*), s* f(x) X y No problem s Problem! X X y y No problem
Kamiran et al 2010
Computational discrimination research: discovery vs. prevention Discovery by statistics, economics communities since 80s, typically in mortgage lending, insurance or job admission/lay-offs typically based on regression (look at coefficients) or statistical hypothesis testing (equality of means) Prevention by machine learning/ data mining community relatively new topic, since 2008-2009 mostly focused on classification so far a lot of challenges ahead
Measuring Very open topic! Basic measure D = p(+ s) p(+ not s) or p(+ s)/p(+ not s), or p(+ s)/p(+),... Taking into account explanatory features easy to measure for discovery (hypothesis testing), difficult for prevention p(+ X,s) p(+ X,not s) Taking into account different decision thresholds normalized measures Romei and Ruggieri 2014, Mancuhan and Clifton 2014, Žliobaitė 2015
Prevention solutions Preprocessing Modify input data X, s or y Resample input data Regularization Postprocessing Modify models Modify outputs
Modify input data Modify y - massaging Kamiran and Calders 2009
Modify input data Modify X massaging Any attributes in X that could be used to predict s are changed such that a fairness constraint is satisfied approach is similar to sanitizing datasets for privacy preservation Feldman et al 2014
Resample Preferential sampling Kamiran and Calders 2010
Regularization Data subset due to split Regular tree induction Decision tree Entropy wrt class label Non-discriminatory tree Entropy wrt protected characteristic Tree induction IGC - IGS Kamiran et a 2010
Postprocessing Modifying model Relabel tree leaves to remove the most discrimination with the least damage to the accuracy Kamiran et a 2010
Prevention solutions Preprocessing Modify input data X, s or y Resample input data Regulatization Postprocessing Modify models Modify outputs From legal perspective Decision manipulation very bad Data manipulation quite bad Protected characteristic should not be used in decision making
Challenges ahead Impact challenges What is the scope of potentially discriminatory applications? Businesses are reluctant to collaborate, afraid of negative publicity Public is not concerned thinking that algorithms are always objective Research challenges Defining the right discrimination measures and optimization criteria Translating legal requirements into mathematical constraints and back Transparency and interpretability of the solutions is critical stakeholders need to understand and trust the solutions
Thanks!