Big Data and Its Empiricist Founda4ons Teresa Scantamburlo
The evolu4on of Data Science The mechaniza4on of induc4on The business of data The Big Data paradigm (data + computa4on) Cri4cal analysis Tenta4ve solu4ons (?) Open problems
Sta4s4cal Learning Theory The ques4on is how a machine, a computer, can learn from examples (= induc&ve inference and generaliza&on ability) The machine is shown par4cular examples (x 1, y 1 ),...,(x n, y n ) of a specific task where x i! X (instances) and y i! Y (labels). Its goal is to infer a general rule f : X! Y (classifier) which can both explain the examples it has seen already and which can generalize to new examples. von Luxburg and Schölkopf, Sta&s&cal Learning Theory: Models, Concepts and Results, 2011
Sta4s4cal Learning Theory
The Business of Data Big Data is not simply denoted by volume. Some characterizing features: velocity, being created in or near real- 4me; variety, being structured and unstructured in nature; exhaus&ve in scope, striving to capture en4re popula4ons or systems (n=all); rela&onal in nature, containing common fields that enable the conjoining of different data sets; fine- grained in resolu4on flexible, holding the traits of extensionality (can add new fields easily) and scaleability (can expand in size rapidly). R. Kitchin, Big data, new epistemologies and paradigm shifs, 2014
The Big Data Paradigm Big Data is less about data that is big than it is about a capacity to search, aggregate, and cross- reference large data sets. Big Data as a socio- technical phenomenon It rests on the interplay of: Technology: maximizing computa4on power and algorithmic accuracy to gather, analyze, link, and compare large data sets. Analysis: drawing on large data sets to iden4fy pa]erns in order to make economic, social, technical, and legal claims. Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objec4vity, and accuracy d. boyd and K. Crawford, Cri&cal ques&ons for Big Data: provoca&ons for a cultural, technological, and scholarly phenomenon, 2012
The end of theory This is a world where massive amounts of data and applied mathema&cs replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguis4cs to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. C. Anderson, The end of theory: The data deluge makes the scien&fic method obsolete, 2008
The triumph of correla4ons Big Data encourages a growing respect for correla&on, which comes to be appreciated as not only a more informa4ve and plausible form of knowledge than the more definite but also a more elusive, causal explana4on. In the words of Mayer- Schönberger and Cukier (2013): the correla4ons may not tell us precisely why something is happening, but they alert us that it is happening. And in many situa&ons this is good enough. S. Leonelli, What Difference Does Quan&ty Make? On The Epistemology of Big Data in Biology, 2014
Empiricism Reborn There is a powerful and a]rac4ve set of ideas at work in the empiricist epistemology that runs counter to the deduc4ve approach that is hegemonic within modern science: Big Data can capture a whole domain and provide full resolu4on; there is no need for a priori theory, models or hypotheses; through the applica4on of agnos4c data analy4cs the data can speak for themselves free of human bias or framing, and any pa]erns and rela4onships within Big Data are inherently meaningful and truthful; meaning transcends context or domain- specific knowledge, thus can be interpreted by R. Kitchin, Big data, new epistemologies and paradigm shifs, 2014
Some reac4ons Claims to objec4vity and accuracy are misleading Bigger data are not always be]er data Taken out of context, Big Data loses its meaning Just because it is accessible does not make it ethical Limited access to Big Data creates new digital divides d. boyd and K. Crawford, Cri&cal ques&ons for Big Data: provoca&ons for a cultural, technological, and scholarly phenomenon, 2012
An interes4ng analysis Both data analysis models and theore4cal scien4fic models are there to solve a problem, one to solve a problem of data analysis, the other to solve a problem of describing an empirical phenomenon. D.M. Bailer- Jones and C.A.L. Bailer- Jones, Modelling data: Analogies in neural networks, simulated annealing and gene&c algorithms, 2002
An interes4ng analysis Data analysis models Beyond the goal of accurate predic4on, the scien&fic insight that computa4onal data models give in a specific case may be limited. Data analysis techniques are not specific to the type of data that are modelled. The techniques are designed to be independent of specific applica4ons they are applica&on- neutral. Theore4cal scien4fic models A theore4cal scien4fic model is, in contrast, specific to a type of phenomenon. The theore4cal concepts and laws that give shape to the theore4cal model are chosen on the basis of the physical proper4es of the phenomenon to be modelled. D.M. Bailer- Jones and C.A.L. Bailer- Jones, Modelling data: Analogies in neural networks, simulated annealing and gene&c algorithms, 2002
An interes4ng analysis D.M. Bailer- Jones and C.A.L. Bailer- Jones, Modelling data: Analogies in neural networks, simulated annealing and gene&c algorithms, 2002
A tenta4ve reconcilia4on In contrast to new forms of empiricism, data- driven science seeks to hold to the tenets of the scien4fic method, but is more open to using a hybrid combina4on of abduc&ve, induc&ve and deduc&ve approaches to advance the understanding of a phenomenon. It seeks to incorporate a mode of induc4on into the research design, though explana4on through induc4on is not the intended end- point (as with empiricist approaches). It forms a new mode of hypothesis genera4on before a deduc4ve approach is employed. The epistemological strategy adopted within data- driven science is to use guided knowledge discovery techniques to iden4fy poten4al ques4on(hypotheses) worthy of further examina4on and tes4ng R. Kitchin, Big data, new epistemologies and paradigm shifs, 2014
A philosophical interpreta4on The mechaniza4on of induc4on The business of data The Big Data paradigm (data + computa4on) Cri4cal analysis Tenta4ve solu4ons (?) Open problems?
Hume s Legacy Hume s an4- ra4onalism polemic contributed to introduce a gap between the knowledge of the world and pure reasoning (Hume s fork) Knowledge of the world = a product of repeated percep&ons. Imagina4on becomes accustomed to foresee the order of events. Note that this expecta4on subsumes a feeling of inevitability, somehow replacing the rejected ra4onal necessity. it arises in the mind spontaneously and naturally, without the involvement of reason, merely because the mind is acted upon by the same objects in the same way repeatedly. Induc4on is replaced at the level of a non- ra4onal feeling whose reliability is leh to the vivacity and the freshness of data percep4on. So, removing any degree of ra4onality (or logos) within content experiences, we are led to reinforce the degree of connec4ons
Open problems Induc4on: abstrac4on and generaliza4on? Induc4on: models of data and models of phenomena?