Big Data: A Critical Analysis!!



Similar documents
How To Understand The Big Data Paradigm

Statistics for BIG data

Big Data Hope or Hype?

Measurement and measures. Professor Brian Oldenburg

Collaborations between Official Statistics and Academia in the Era of Big Data

Workshop Discussion Notes: Housing

Information Visualization WS 2013/14 11 Visual Analytics

Data Isn't Everything

Healthcare data analytics. Da-Wei Wang Institute of Information Science

The Networked Nature of Algorithmic Discrimination

ICT Perspectives on Big Data: Well Sorted Materials

CREDIT TRANSFER: GUIDELINES FOR STUDENT TRANSFER AND ARTICULATION AMONG MISSOURI COLLEGES AND UNIVERSITIES

Business Intelligence and Decision Support Systems

Bioethics Program Program Goals and Learning Outcomes

Overview. Triplett (1898) Social Influence - 1. PSYCHOLOGY 305 / 305G Social Psychology. Research in Social Psychology 2005

Making Critical Connections: Predictive Analytics in Government

CHAPTER THREE: METHODOLOGY Introduction. emerging markets can successfully organize activities related to event marketing.

College of Arts and Sciences: Social Science and Humanities Outcomes

Making critical connections: predictive analytics in government

School of Advanced Studies Doctor Of Management In Organizational Leadership. DM 004 Requirements

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE

Big Data for Development: What May Determine Success or failure?

Big Data / Privacy: Pick One?

MIDLAND ISD ADVANCED PLACEMENT CURRICULUM STANDARDS AP ENVIRONMENTAL SCIENCE

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Data Mining. Toon Calders TU Eindhoven

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

CFSD 21 ST CENTURY SKILL RUBRIC CRITICAL & CREATIVE THINKING

Integrated Social and Enterprise Data = Enhanced Analytics

Research Methods Carrie Williams, ( Grand Canyon University

Big Data Discovery: Five Easy Steps to Value

CLUSTER ANALYSIS WITH R

Competencies for Secondary Teachers: Computer Science, Grades 4-12

Big Data Integration: A Buyer's Guide

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Opportunities and Limitations of Big Data

Privacy: Legal Aspects of Big Data and Information Security

Big Data in Communication Research: Its Contents and Discontents

Undergraduate Psychology Major Learning Goals and Outcomes i

T Non-discriminatory Machine Learning

Exhibit Inquiry Sheets

Big Data. Fast Forward. Putting data to productive use

School of Advanced Studies Doctor Of Management In Organizational Leadership/information Systems And Technology. DM/IST 004 Requirements

From Data to Foresight:

Analysing Qualitative Data

Uncovering Value in Healthcare Data with Cognitive Analytics. Christine Livingston, Perficient Ken Dugan, IBM

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

The Big Data methodology in computer vision systems

Sanjeev Kumar. contribute

Theoretical Perspective

I D C E X E C U T I V E B R I E F

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Visualization methods for patent data

LEARNING OUTCOMES FOR THE PSYCHOLOGY MAJOR

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Kindergarten to Grade 4 Manitoba Foundations for Scientific Literacy

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Building and deploying effective data science teams. Nikita Lytkin, Ph.D.

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

RESEARCH PROCESS AND THEORY

Data Mining Applications in Higher Education

How to Develop a Research Protocol

Chapter 2 Conceptualizing Scientific Inquiry

Learning is a very general term denoting the way in which agents:

CHAPTER 1 INTRODUCTION

Sentiment Analysis on Big Data

DOCTOR OF BUSINESS ADMINISTRATION POLICY

Research Methods: Qualitative Approach

Georgia Department of Education

Scholars Journal of Arts, Humanities and Social Sciences

Organizing Your Approach to a Data Analysis

Using Big Data Analytics to

College-wide Goal Assessment Plans (SoA&S Assessment Coordinator September 24, 2015)

How Big Data is Different

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

DON T GET LOST IN THE FOG OF BIG DATA

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Information Management course

The Scientific Data Mining Process

Predictive Analytics Certificate Program

BIG DATA FOR DEVELOPMENT: A PRIMER

Event Summary: The Social, Cultural, & Ethical Dimensions of Big Data

Qualitative Data Collection and Analysis

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Transcription:

DAIS - Università Ca Foscari Venezia Teresa Scantamburlo Big Data: A Critical Analysis!! 23th April 2015! Politecnico di Milano

Outline The Realm of Big Data Big Data definitions Big Data paradigm Examples (Research and Applications) Philosophical assumptions The empiricist approach Critical aspects Hume s legacy and mechanized induction Open problems Models of data vs. model of phenomena The role of induction in cognitive activity

The Realm of Big Data

Digital Footprints

Internet of Things

The Age of Big Data We are witnessing an exceptional growth of flows of information we are entering the age of big data. The term big data refers to datasets whose size is beyond the ability of typical database software tolls to capture, store, manage and analyse (McKinsey Global Institute, 2011). It s a revolution, says Gary King, director of Harvard s Institute for Quantitative Social Science. We re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched (New York Times, 2012).

Big Data Innovations 1. We can analyse far more data, in some cases we can process all of it relating to a particular phenomenon (comprehensiveness); 2. big data is messy, varies in quality, and is distributed among countless servers around the world. With big data we ll be satisfied with a sense of general direction rather than knowing a phenomenon to the inch, the penny, the atom (messiness); 3. In a big data world we don t have to be fixated on causality, we can discover patterns and correlations, which may not tell us why something is happening but they alert us that something is happening (correlation) (Mayer- Schönberger and Cukier, 2013)

Characterizing Features VELOCITY: being created in or near real-time VARIETY: being structured and unstructured in nature EXHAUSTIVE IN SCOPE: striving to capture entire populations or systems (n=all) RELATIONAL: containing common fields that enable the conjoining of different data sets FINE-GRAINED in resolution FLEXIBLE, holding the traits of extensionality (can add new fields easily) and scaleability (can expand in size rapidly). (R. Kitchin, 2014)

Big Data Paradigm Big data as a socio-technical phenomenon It does not only refers to very large data sets and the tools and procedures used to manipulate and analyse them, but also to computational turn in thought and research. It is a profound change at the levels of epistemology and ethics. Big data reframes key questions about the constitution of knowledge, the process of research, how we should engage with information, and the nature and the categorisation of reality (d. boyd and K. Crawford, 2012)

Computational X Big data and analytics are fostering the emergence of new signposts, Computational + X, and the development of new research areas: Computational social science Computational Biology Computational Physics Computational Chemistry Computational Economics Computational Medicine Computational Low Computational Linguistics Digital Humanities Computer ethics... This trend can be viewed as a result of what has been called infocomputationalism, the framework which is based on two fundamental concepts: information as a structure (the fabric of the universe) and computation as its dynamics (G. Dodig Crnkovic, 2010)

Big Data Business Conferences (new and old) Journals, Books, etc. Education (Courses, summer schools) Research centres Research projects Companies and start-up

Computational Social Science The main computational social science areas are: automated information extraction systems and social network analysis social geographic information systems (GIS), complexity modelling social simulation models The Wisdom of Crowds If you put together a big enough and diverse group of people and ask them to make decisions affecting matters of general interest that group s decisions will, over time, be intellectually superior to the isolated individual, no matter how smart or well informed he is J. Surowiecki, 2004

Disease Detection By processing hundreds of billions of individual searches from five years of Google web search logs, our system generates more comprehensive models for use in influenza surveillance, with regional and state-level estimates of influenza-like illness (ILI) activity in the United States. J. Ginsberg et al., 2009 Influenza-like illness (ILI) activity in the United States Red = prediction by U.S. Centers for Disease Control and Prevention Black = prediction by aggregating historical logs

Mass Media Analysis The contents of English-language online-news over 5 years have been analysed to explore the impact of the Fukushima disaster on the media coverage of nuclear power. This big data study, based on millions of news articles, involves the extraction of narrative networks, association networks, and sentiment time series. The key finding is that media attitude towards nuclear power has significantly changed in the wake of the Fukushima disaster. T. Lansdall-Welfare et al., 2014 BEFORE DISASTER! AFTER DISASTER!

Recruiting system Some companies are using big data to recruit new employees or to predict which employees are likely to flourish or fail. With data mining techniques we could, e.g.: estimate the specific numerical value of sales Predict production time, or tenure period Rank employees. For example, Applicant Tracking Systems (ATS) software can score and sort resumes and other job application materials from a central database and rank applicants in order to achieve the best fit between a job opening and available job candidates (Data and Society research Institute, 2014) BetterWorks (a company in Palo Alto) makes office software that blends aspects of social media, fitness tracking and video games into a system meant to keep employees more engaged with their work and one another (New York Times, 2015)

Crime Fighting The Chicago Police Department conducted a research project that looked at data collected by the police department to see if Big Data analytics could be applied in police work. We could in fact leverage data science across police administrative data and use it as a framework to use predictive data to prevent violence. (http://www.datacenterdynamics.com/) The London's Metropolitan Police Service is using a new software which pulls large amounts of data in-use by the police service and puts it through an advanced analytics engine to predict when criminals are likely to strike. By analysing five years' worth of data, it is hoped that an accurate prediction of when / if a criminal will re-offend can be made. (http://www.cloudcomputing-news.net)

Philosophical Assumptions

The End of Theory This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behaviour, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. (C. Anderson, 2008) Scientists no longer have to make educated guesses, construct hypotheses and models, and test them with data-based experiments and examples. Instead, they can mine the complete set of data for patterns that reveal effects, producing scientific conclusions without further experimentation. (M. Prensky, 2009)

The Effectiveness of Data We should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data. The biggest successes in natural-language-related machine learning have been statistical speech recognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks...the reason is that a large training set of the input-output behaviour that we seek to automate is available to us in the wild. (A. Halevy, P. Norvig and F. Pereira, 2009)

The Triumph of Correlations There is now a better way. Petabytes allow us to say: Correlation is enough We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot...correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. (C. Anderson, 2008) The correlations may not tell us precisely why something is happening, but they alert us that it is happening. And in many situations this is good enough. (V. Mayer- Schönberger and K. Cukier, 2013)

Empiricism Reborn Summarizing the main tenets of the empiricist approach to big data are: big data can capture a whole domain and provide full resolution; there is no need for a priori theory, models or hypotheses; through the application of agnostic data analytics the data can speak for themselves free of human bias or framing, and any patterns and relationships within big data are inherently meaningful and truthful; meaning transcends context or domain-specific knowledge, thus can be interpreted by be interpreted by anyone who can decode a statistic or data visualization. (R. Kitchin, 2014)

Empiricism & Hume s Legacy The debate between rationalism and empiricism Rationalists: concepts and knowledge are gained independently of sense experience Empiricists: sense experience is the ultimate source of all our concepts and knowledge Hume s view of knowledge it arises in the mind spontaneously and naturally, without the involvement of reason, merely because the mind is acted upon by the same objects in the same way repeatedly

Alternative Approaches There are alternative approaches to empiricism. They view big data and analytics as a positive contribution to scientific practice without considering them as a oracle or a conclusive solution. Data-driven science as a hybrid combination of abductive, inductive and deductive approaches to advance the understanding of a phenomenon. It forms a new mode of hypothesis generation before a deductive approach is employed. The epistemological strategy adopted within data-driven science is to use guided knowledge discovery techniques to identify potential question (hypotheses) worthy of further examination and testing (R. Kitchin, 2014)

Objective Science? In reality, working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth (i.e., consider social media) (d. boyd and K. Crawford, 2012) Big data is not self-explanatory. And yet the specific methodologies for interpreting the data are open to all sorts of philosophical debate. Can the data represent an objective truth or is any interpretation necessarily biased by some subjective filter or the way that data is cleaned? (Bollier, 2010) Critical aspects on objectivity and accuracy: biases and subjective choices large data sets and data errors knowing the weaknesses in the data

Quality vs. Quantity? Big data offers the humanistic disciplines a new way to claim the status of quantitative science and objective method. Big data may support the mistaken belief that qualitative researchers are in the business of interpreting stories and quantitative researchers are in the business of producing facts Big data risks re-inscribing established divisions in the long running debates about scientific method and the legitimacy of social science and humanistic inquiry. (d. boyd and K. Crawford, 2012)

Data Out of Context Because large data sets can be modelled, data are often reduced to what can fit into a mathematical model. Yet, taken out of context, data lose meaning and value. The rise of social network sites prompted an industrydriven obsession with the social graph. (d. boyd and K. Crawford, 2012) Critical aspects on data and contextual information: social graph are not equivalent to personal networks (i.e. consider the notion of tie strength) not every connection is equivalent to every other connection conveyed information may change over the network

Ethical implications Being in public is not the same as being public it is problematic for researchers to justify their actions as ethical simply because the data are accessible. Just because content is publicly accessible does not mean that it was meant to be consumed by just anyone (problem of accountability and informed consent) Limited access to big data creates new digital divide Some companies restrict access to their data entirely; others sell the privilege of access for a fee; and others offer small data sets to universitybased researchers...the current ecosystem around big data creates a new kind of digital divide: the big data rich and the big data poor. (d. boyd and K. Crawford, 2012)

Is Big Data Unfair? As we re on the cusp of using machine learning for rendering basically all kinds of consequential decisions about human beings in domains such as education, employment, advertising, health care and policing, it is important to understand why machine learning is not, by default, fair or just in any meaningful way. This runs counter to the widespread misbelief that algorithmic decisions tend to be fair, because, you know, math is about equations and not skin colour. (H. Moritz, 2014) After all, as the former CPD [Chicago Police Department] computer experts point out, the algorithms in themselves are neutral. This program had absolutely nothing to do with race but multi-variable equations. Meanwhile, the potential benefits of predictive policing are profound. (Gilian Tett, financial reporter)

Discriminatory impact Inequalities might be conveyed in various ways and potential harms are directly concerned with the inner structure of algorithmic decision procedures. Big data driven decision making could have discriminatory effects even in the absence of discriminatory intent. Further concerns are expressed for an opaque decision-making environment and an impenetrable set of algorithms Approached without care, data mining can reproduce existing patterns of discrimination, inherit the prejudice of prior decision-makers, or simply reflect the widespread biases that persist in society. It can even have the perverse result of exacerbating existing inequalities. (S. Barocas and A.D. Selbst, 2014)

How Discrimination Occurs Machine learning and data mining represent a form of statistical discrimination. Basically they aim to end up with classification/groupings which make sense. In the machine learning procedures there are several mechanisms/steps which can play a role in the the production of discriminatory results: Defining the Target Variable and Class Labels Training Data Feature selection Proxies Masking (S. Barocas and A.D. Selbst, 2014)

Machine Learning The field of machine learning studies how a machine/computer can learn specific tasks by following specified learning algorithms. As opposed to artificial intelligence, it does not try to explain or generate intelligent behaviour, its goal is to discover mechanisms by which very specific tasks can be learned by a computer (inductive inference and generalization ability) Statistical Learning Theory Framework The machine is shown particular examples where (instances) and (labels). of a specific task Its goal is to infer a general rule (classifier) which can both explain the examples it has seen already and which can generalize to new examples. (von Luxburg and Schölkopf, 2011)

Statistical Learning Theory

Defining Target Variable The proper specification of the target variable is not always obvious. In some problems defining the outcome of interest could be difficult. There are different degrees of difficulty: Spam detection (simple binary classification) Credit scoring ( creditworthy is a more problematic category) Employment decisions (the definition of a good employee is not given) General lesson: while critics of data mining have tended to focus on inaccurate classifications (false positives and false negatives), as much if not more danger resides in the definition of the class label itself and the subsequent labelling of examples from which rules are inferred (S. Barocas and A.D. Selbst, 2014)

Training Data Discriminatory training data leads to discriminatory models. This may happen in two ways: Labelling examples: the analyst introduces biases and prejudices in the choice of examples (the classifier will reproduce the prejudices embedded in the examples). But prior prejudice can be inherited by on-going behaviour of users taken as inputs to data mining. Data collection: disadvantaged groups are less involved in the formal economy and its data-generating activities, because they have unequal access to and relatively less fluency in the technology necessary to engage online, or because they are less profitable customers or important constituents and therefore less interesting as targets of observation (S. Barocas and A.D. Selbst, 2014)

FATML at NIPS and ICML FAT ML = Fairness, Accountability and Transparency in Machine Learning Present at NIPS 2014 and ICML 2015 Organizers: S. Barocas, S. Friedler, M. Hardt, J. Kroll, S. Venkatasubramanian, H. Whallach http://www.fatml.org/

Open Problems

The Rationale of Data Science The development of data science poses several questions about the meaning and the role of inductive inference in research activities and decision making. Some open problems regard: Data science and the philosophical accounts of induction (Hume s legacy and different perspectives) The role of inductive inference in the models of data (abstraction) and in the models of phenomena (generalization) Models of data in the scientific practice and other human activities (i.e. practical reasoning)

Models of Data Data analysis models Beyond the goal of accurate prediction, the scientific insight that computational data models give in a specific case may be limited. Data analysis techniques are not specific to the type of data that are modelled. The techniques are designed to be independent of specific applications they are application-neutral. Theoretical scientific models A theoretical scientific model is, in contrast, specific to a type of phenomenon. The theoretical concepts and laws that give shape to the theoretical model are chosen on the basis of the physical properties of the phenomenon to be modelled. (D.M. Bailer-Jones and C.A.L. Bailer-Jones, 2002)

Models of Data (D.M. Bailer-Jones and C.A.L. Bailer-Jones, 2002)

References C. Anderson, The end of theory: The data deluge makes the scientific method obsolete, 2008 S. Barocas and A.D. Selbst, Big Data s Disparate Impact, 2014 D.M. Bailer-Jones and C.A.L. Bailer-Jones, Modelling data: Analogies in neural networks, simulated annealing and genetic algorithms, 2002 D. Bollier, The promise and the peril of big data, 2010 d. boyd and K. Crawford, Critical questions for Big Data: provocations for a cultural, technological, and scholarly phenomenon, 2012 G. Dodig Crnkovic, Biological information and natural computation, 2010 S. Leonelli, What Difference Does Quantity Make? On The Epistemology of Big Data in Biology, 2014 R. Kitchin, Big data, new epistemologies and paradigm shifts, 2014 A. Halevy, P. Norvig and F. Pereira, The Unreasonable Effectiveness of Data, 2009 V. Mayer- Schönberger and K. Cukier, Big Data: A Revolution that Will Change How We Live, 2013 H. Moritz, How big data is unfair. Understanding sources of unfairness in data driven decision making, 2014

Thanks!