1 Big Data and Science: Myths and Reality H.V. Jagadish
2 Six Myths about Big Data It s all hype It s all about size It s all analysis magic Reuse is easy It s the same as Data Science It s all in the cloud
3 Big Data Myth 1 Big Data is all hype.
4 Data Analysis Has Been Around for a While R.A. Fisher Peter Luhn W.E. Demming 1970: Relational Database E.F. Codd Howard Dresner Abridged Version of Jeff Hammerbacher s timeline for CS 194 at UCB, 2012
5 Breathless Journalists!!
6 Big Data Impetus Can collect cheaply, due to digitization. Can store cheaply, due to falling media prices. Driven by business process automation and the web. But now impacting everywhere.
7 Nearly every field of endeavor is transitioning from data poor to data rich Astronomy: LSST Physics: LHC Oceanography: OOI Neuroscience: EEG, fmri Sociology: The Web Biology: Sequencing Economics: mobile, POS terminals Data-Driven Medicine Sports Slide courtesy of the Moore Sloan Data Science Environments Initiative 7
8 Data-Driven Science 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray Slide courtesy of the Moore Sloan Data Science Environments Initiative 8
9 Big Data Hype? The Gartner Hype Cycle Just because it s hyped doesn t mean we can or should ignore it Slide courtesy of Michael Franklin
10 Big Data Fact 1 Big Data is all hype. It may be hyped, but there is more than enough substance there for it to deserve our attention.
11 Big Data Myth 2 Size is all that matters. Challenges are only at the extremes (in size).
12 What is Big Data Gartner Definition: Volume Velocity Variety Veracity V..
13 Variety How do you even measure variety? No measure => hard to track progress Infinite variety on the web You keep finding sites you have never seen before Infinite variety in human generated content
14 Veracity Who do you trust? Reputation on the web. Independence determination When is it a new source and when is it a copy?
15 Big Data Fact 2 Size is all that matters. Yes, Volume and Velocity are challenging But Variety and Veracity are far more challenging
16 Big Data Myth 3 Big Data Analysis Magic Deep Insights
17 Companies Propagate This!! From the web site of a representative silicon valley company
18 The Big Data Pipeline
19 Big Data Challenges In each of the steps Read the whitepaper: Shorter version in CACM, July 2014.
20 Big Data Fact 3 Every aspect of the data ecosystem poses challenges that must be addressed.
21 Big Data Myth 4 Data reuse is low hanging fruit Lots of data collected for some purpose Can (later) be used for a different purpose
22 Unemployment Rate Prediction based on Tweets Cafarella, Levenstein, Shapiro
23 Data is Organized Wrong E.g. administrative data is often rolled up by administrative jurisdiction. Consider Butler County, Ohio.
26 Data is Organized Wrong E.g. administrative data is often rolled up by administrative jurisdiction. How to compare data rolled by school district with data rolled up by zip code? Working with Gates Foundation Create *estimated* data rolled up by desired jurisdiction.
27 Research Data Reuse Much data is now available Strong push from federal agencies Parallel push from reproducibility advocates But obstacles remain Incentives to record metadata. Very hard for third party to use otherwise Data citation methodology and convention
28 Big Data Fact 4 Data reuse is low hanging fruit Data reuse is critical to address Holds out great promise But also poses many challenging questions
29 Big Data Myth 5 Data Science is the same as Big Data
30 Data Science The use of data to address problems in a domain of interest. Requires data management, data analysis, and domain knowledge. Often involves Big Data But may not
31 Computer & Information Sciences Statistical & Mathematical Sciences Data Science Domain Sciences
32 Data Science Status Importance widely recognized in academia. Partly driven by employer demand Multi-disciplinary nature recognized. Common solution is to have some sort of structure that overlays and crosses traditional departments E.g.
33 Big Data Fact 5 Data Science is the same as Big Data Data Science is related to, but different from, Big Data
34 Big Data Myth 6 The central challenge with Big Data is that of devising new computing architectures and algorithms.
35 Big Data Myth 6 (reprise) Big Data is all in the cloud Big Data = Map Reduce style computation
36 What is Big Data Volume Velocity Variety Veracity More than you know how to handle.
37 Humans and Big Data We can buy bigger systems, more machines, faster CPU, larger disks. But human ability does not scale! Big Data poses huge challenges for human interaction.
38 Usability for Data Science Data Science tasks usually involve data analysis by a domain expert with limited database expertise. If domain expert is to succeed, data must be usable. Usability matters most when the data are big.
39 Database Usability Improve user s ability to complete a task with a (big) database through better: Query formulation Result presentation HCI principles are very useful But, usability is not interface design. See
40 Big Data Fact 6 Big Data is all about the cloud. The cloud has its place in the constellation of relevant technologies, but is not a required piece of every solution. In fact, there are many other challenges that are at least as important cf. National Academies report on Frontiers of Massive Data Analysis
41 Acknowledgments NSF Grants and
42 Big Data and Data Science Lots of Buzz With good reason Great potential Many challenges
International Journal of Computer Science and Applications, Technomathematics Research Foundation Vol. 11, No. 3, pp. 116 127, 2014 ANALYTICS ON BIG AVIATION DATA: TURNING DATA INTO INSIGHTS RAJENDRA AKERKAR
Challenges and Opportunities with Big Data A community white paper developed by leading researchers across the United States Executive Summary The promise of data-driven decision-making is now being recognized
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
DELIVERING ON THE PROMISE OF BIG DATA AND THE CLOUD by Mark Jacobsohn Senior Vice President Booz Allen Hamilton Joshua Sullivan, PhD Vice President Booz Allen Hamilton WHY CAN T WE SEEM TO DO MORE WITH
New Data for Understanding the Human Condition: International Perspectives OECD Global Science Forum Report on Data and Research Infrastructure for the Social Sciences Data-driven and evidence-based research
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
David Chappell October 2012 WINDOWS AZURE DATA MANAGEMENT CHOOSING THE RIGHT TECHNOLOGY Sponsored by Microsoft Corporation Copyright 2012 Chappell & Associates Contents Windows Azure Data Management: A
Making Smart IT Choices Understanding Value and Risk in Government IT Investments Sharon S. Dawes Theresa A. Pardo Stephanie Simon Anthony M. Cresswell Mark F. LaVigne David F. Andersen Peter A. Bloniarz
A Cooperative Agreement Program of the Federal Maternal and Child Health Bureau and the American Academy of Pediatrics Acknowledgments The American Academy of Pediatrics (AAP) would like to thank the Maternal
WINTER 2011 VOL.52 NO.2 Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz Big Data, Analytics and the Path From Insights to Value REPRINT NUMBER 52205 THE NEW INTELLIGENT
Content Marketing Integration Workbook 730 Yale Avenue Swarthmore, PA 19081 www.raabassociatesinc.com firstname.lastname@example.org Introduction Like the Molière character who is delighted to learn he has
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 1 Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives Zhi-Hua Zhou, Nitesh V. Chawla, Yaochu Jin, and
Hybrid: The Next Generation Cloud Interviews Among CIOs of the Fortune 1000 and Inc. 5000 IT Solutions Survey Wakefield Research 2 EXECUTIVE SUMMARY: Hybrid The Next Generation Cloud M ost Chief Information
Is Big Data too Big for SMEs? MS&E-238: Leading Trends in Information Technology Stanford University Summer 2014 Carl Johan Rising - SUID: 006000731 Michael Kristensen - SUID: 005999170 Steffen Tjerrild-Hansen
AAPOR Report on Big Data AAPOR Big Data Task Force February 12, 2015 Prepared for AAPOR Council by the Task Force, with Task Force members including: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter,
Semester: Title: Cloud computing - impact on business Project Period: September 2014- January 2015 Aalborg University Copenhagen A.C. Meyers Vænge 15 2450 København SV Semester Coordinator: Henning Olesen
How to increase Marketing Efficiency to Gain and Retain Customers How marketing automation and CRM can help a midsized business consolidate data, improve customer information, streamline marketing efforts,
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 1 (2014), pp. 33-40 International Research Publications House http://www. irphouse.com /ijict.htm Big Data
White paper Proactive Planning for.. Big Data.. In government, Big Data presents both a challenge and an opportunity that will grow over time. Executive Summary Consider this list of government-adopted
Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society Randal E. Bryant Carnegie Mellon University Randy H. Katz University of California, Berkeley Version 8: December
Big Data: Opportunities and Challenges Omar Al-Hujran, Rana Wadi, Reema Dahbour, Mohammad Al-Doughmi Department of Management Information Systems Princess Sumaya University for Technology Amman, Jordan
Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success June, 2013 Contents Executive Overview...4 Business Innovation & Transformation...5 Roadmap for Social, Mobile and Cloud Solutions...7
Ville-Pekka Peltonen Software Asset Management Current state and use cases Helsinki Metropolia University of Applied Sciences Master of Business Administration Master's Degree Programme in Business Informatics
MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic Declaration Hereby I declare that this paper is my original
COULD VS. SHOULD: BALANCING BIG DATA AND ANALYTICS TECHNOLOGY The business world is abuzz with the potential of data. In fact, most businesses have so much data that it is difficult for them to process
Big Data: HR s Golden Opportunity Arrives ABSTRACT With the arrival of Big Data, HR is faced with an unprecedented opportunity to become more datadriven, analytical and strategic in the way that it acquires
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
Open to All? Case studies of openness in research A joint RIN/NESTA report September 2010 1 2 Executive Summary Background Since the early 1990s, the open access movement has promoted the concept of openness