1 Massive Data Analysis: Course Overview Juliana Freire! Content obtained from many sources, including: Agrawal et al., VLDB 2010 tutorial; Shim, VLDB 2012 tutorial; Jeff Ullman s lecture notes, G. Weikum
2 Instructors:! o Juliana Freire! Course Staff and Information o Jerome Simeon! Reach us at More info on In our wiki you will find:! Tentative schedule! News and announcements! Reading list! Assignments! Check it often!!! 2!
3 What we will cover Infrastructure: Architecture, computing models (e.g., MapReduce), storage solutions (e.g., Big Table, MongoDB), query/processing languages! Algorithms and analysis: statistics, data mining techniques! Tentative schedule in:! Readings from:! o Scientific papers! o Textbooks (they are free to download!)! Mining of Massive Data Sets (version 1.1), by Anand Rajaraman, Jure Leskovec and Jeff Ullman. Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer.
4 Pre- Requisites A course in database systems, covering application programming in SQL and other database-related languages such as XQuery! A course on algorithms and data structures! Good programming skills!
5 What you will do Programming assignments (50%) done individually! o You will need a computer! o We will provide you access to Amazon AWS (more details later)! Quizzes (15%): you will use Gradiance! o Register at o Use token 00B06796! Final exam (35%)!
7 Big Data: What is the Big deal?
8 Big Data: What is the Big deal? Many success stories! o Google: many billions of pages indexed, products, structured data! Google grew from processing 100 TB of data a day with o Facebook: MapReduce 1.1 billion in 2004 users  using to processing the site 20 each PB a day month! with MapReduce in 2008 . In April 2009, a blog post1 was wri^en about ebay s two enormous data warehouses: one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Lin and Dyer, 2010 o Twitter: 517 million accounts, 250 million tweets/day This is changing society!!
9 The McKinsey Report on Big Data Data have swept into every industry and business function and are now an important factor of production, alongside labor and capital. We estimate that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart's data warehouse in 1999) per company with more than 1,000 employees! The use of big data will become a key basis of competition and growth for individual firms.! There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.!
10 Big Data: New Opportunities Enable scientific breakthroughs! Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider, Sloan Sky Survey, genomes, climate data,! Social data, e.g., Facebook, Twitter! 3,180,000 and 3,410,000 results in Google Scholar!!
11 Big Data: New Opportunities Smart Cities: 50% of the world population lives in cities! o Census, crime, emergency visits, taxis, public transportation, real estate, noise, energy,! Cities are making their data available!! https://nycopendata.socrata.com/! Make cities more efficient and sustainable, and improve the lives of their citizens!
12 NYC Inspections New York City gets 25,000 illegal-conversion complaints a year, but it has only 200 inspectors to handle them.! Flowers group integrated information from 19 different agencies that provided indication of issues in buildings!! Result: hit rate for inspections went from 13% to 70%!
13 Big Data: New Opportunities NYU CUSP aims to use New York City as its laboratory and classroom to help cities around the world become more productive, livable, equitable, and resilient. CUSP observes, analyzes, and models cities to optimize outcomes, prototype new solutions, formalize new tools and processes, and develop new expertise/experts.!!!!http://cusp.nyu.edu/!
14 Big Data: New Opportunities Data is currency: companies are profiting from knowledge extracted from Big Data! o Better understand customers, targeted advertising,!
15 Big Data: New Opportunities h^p://blogs.wsj.com/venturecapital/tag/big- data/
16 What is Massive/Big Data? The three V s of big data: Volume, Variety, and Velocity! Too big: petabyte-scale collections or lots of (not necessarily big) data sets! Too hard: does not fit neatly in an existing tool! o Data sets that need to be cleaned, processed and integrated! o E.g., Twitter, news, customer transactions! Too! fast: needs to be processed quickly!!
17 Big Data: What is the Big deal? Big data is not new: financial transactions, call detail records, astronomy,! What is new:! - Many more data enthusiasts! - More data are widely available, e.g., Web, data.gov, scientific data, social and urban data! - Computing is cheap and easy to access! o Server with 64 cores, 512GB RAM ~$11k! o Cluster with 1000 cores ~$150k! o Pay as you go: Amazon EC2!
18 Big Data: More than Volume Volume = Length Width Depth Big Data Length: Collect & Compare Big Data Width: Discover & Integrate Big Data Depth: Analyze & Understand Slide by Gerhard Weikum
19 Big Urban Data: NYC Taxis
20 Collect, Clean, and Compare Beijing NYC Compare diﬀerent cities
21 Collect, Clean, and Compare 7-8am 8-9am 9-10am 10-11am Compare effects over time Bigger picture of city life!
22 Discover and Integrate Compare with other data sources, e.g., NYC Citi bikes Was there a traffic problem? An important event? Discover information in news, blogs, etc.
23 Discover and Integrate Compare with other data sources, e.g., NYC Citi bikes Discover information in news, blogs, etc.
24 Analyze and Understand The Sandy Effect
25 Analyze and Understand Studying traffic pa^erns to and from the airports
26 Taxis in NYC: Rides per Hour
27 Big Data: What is hard? Scalability for computations? NOT!! o Lots of work on distributed systems, parallel databases,! o Elasticity: Add more nodes!! But there are no one-size-fits-all solution: often, you have to build your own! Rapidly-evolving technology! Many different tools! Different computation model: need new algorithms!
28 Big Data: What is hard? Scalability for people: Data exploration is hard! regardless of whether data are big or small! algorithms provenance data integration machine learning visual encodings statistics interaction modes data curation math data management data knowledge
29 (Big) Data Analysis Pipeline
30 Big Data: Challenges
31 Big Data: Challenges Apple: Fruit or company?
32 Big Data: Opportunities and Challenges
33 Big Data: Challenges Taxi data: >.5 billion trips! Can t load on Excel and! even commercial databases! are too slow! Our solution to support! interactive queries! o New spatio-temporal index! o New index that leverages GPU (work in progress)!!
34 Big Data: New Technologies Infrastructure:! New computing paradigms: Cloud, Hadoop Map/ Reduce! New storage solutions: NoSQL, column stores, Big Table! New languages: JAQL, Pig Latin! We will study these and how they relate to previous technologies! Analysis and Mining:! New infrastructure demands new approaches to explore data! We will study algorithms to process and analyze data in Big-Data environments!
36 What is Cloud Computing? Old idea: Software as a Service (SaaS)! o Delivering applications over the Internet!! Recently: [Hardware, Infrastructure, Platform] as a service!! Utility Computing: pay-as-you-go computing! o Illusion of infinite resources! o No up-front cost! o Fine-grained billing (e.g., hourly)! Agrawal et al., VLDB 2010 Tutorial!
37 Cloud Computing: Why Now? Experience with very large data centers! o Unprecedented economies of scale! o Transfer of risk! Technology factors! o Pervasive broadband Internet! o Maturity in virtualization technology! Business factors! o Minimal capital expenditure! o Pay-as-you-go billing model! Agrawal et al., VLDB 2010 Tutorial!
38 Warehouse Scale Computing Google s data center in Oregon 16 Million Nodes per building Agrawal et al., VLDB 2010 Tutorial!
39 Economics of Cloud Users Pay by use instead of provisioning for peak Resources Capacity Demand Resources Capacity Demand Agrawal et al., VLDB 2010 Tutorial! Time Static data center Time Unused resources Data center in the cloud Slide Credits: Berkeley RAD Lab
40 Economics of Cloud Users Risk of over- provisioning: underutilization Resources Capacity Demand Unused resources Time Static data center Agrawal et al., VLDB 2010 Tutorial! Slide Credits: Berkeley RAD Lab
41 Economics of Cloud Users Heavy penalty for under- provisioning Resources Time (days) Agrawal et al., VLDB 2010 Tutorial! Capacity Demand Resources Resources Time (days) Lost revenue Time (days) Lost users Slide Credits: Berkeley RAD Lab Capacity Demand Capacity Demand
42 Just hype? Cloud Computing? What are you! talking about? Cloud Computing is! nothing but a computer attached to! a network.! - - Larry Ellison, Excerpts from an interview Agrawal et al., VLDB 2010 Tutorial!
43 Cloud Computing: Hype or Reality Unlike the earlier attempts:! o Distributed Computing! o Distributed Databases! o Grid Computing! Cloud Computing is REAL:! o Organic growth: Google, Yahoo, Microsoft, and Amazon! o Poised to be an integral aspect of National Infrastructure in US and elsewhere! Agrawal et al., VLDB 2010 Tutorial!
44 Cloud Computing Modalities Can we outsource our IT software and hardware infrastructure? Hosted Applications and services! Pay-as-you-go model! Scalability, fault-tolerance, elasticity, and self-manageability! We have terabytes of click-stream data what can we do with it? Very large data repositories! Complex analysis! Distributed and parallel data processing! Agrawal et al., VLDB 2010 Tutorial!
45 Why Data Analysis? What is the most effective distribution channel? Who are our lowest/highest margin customers? Who are my customers and what products are they buying? What product prom- -otions have the biggest impact on revenue? VLDB 2010 Tutorial! What impact will new products/services! have on revenue and margins? Which customers are most likely to go to the competition?
46 Why Data Analysis? What would the! impacts be of a! Fare change? Where are our lowest/highest margin passengers? What is the distribution! of trip lengths? What is the quickest! route from midtown! To downtown at 4pm on! Monday? What impact will the introduction of! additional medallions! have? Where should drivers! go to get! passengers?
47 Decision Support Used to manage and control business! Data is historical or point-in-time! Optimized for inquiry rather than update! Use of the system is loosely defined and can be ad-hoc! Used by managers and end-users to understand the business and make judgements! Agrawal et al., VLDB 2010 Tutorial!
48 Decision Support Data-analysis in the enterprise context emerged:! o As a tool to build decision support systems! o Data-centric decision making instead of using intuition! o New term: Business Intelligence! Traditional approach:! o Decision makers wait for reports from disparate OLTP systems! o Put it all together in a spreadsheet! o Manual process! Agrawal et al., VLDB 2010 Tutorial!
49 Data Analytics in the Web Context Data capture at the user interaction level:! o in contrast to the client transaction level in the Enterprise context! As a consequence, the amount of data increases significantly! Need to analyze such data to understand user behaviors! Agrawal et al., VLDB 2010 Tutorial!
50 Data Analytics outside Big Corporations Even data capture at client transaction level leads to a lot of data!!! Need to analyze such data to understand behavior! Cannot afford expensive warehouse solutions!
51 Data Analytics in the Cloud Scalability to large data volumes:! o Scan 100 TB on 1 50 MB/sec = 23 days! o Scan 100 TB on 1000-node cluster = 33 minutes! è Divide-And-Conquer (i.e., data partitioning)!! Cost-efficiency:! o Commodity nodes (cheap, but unreliable)!! o Commodity network! o Automatic fault-tolerance (fewer admins)! o Easy to use (fewer programmers)! Agrawal et al., VLDB 2010 Tutorial!
52 Platforms for Large- scale Data Analysis Parallel DBMS technologies! o Proposed in the late eighties! o Matured over the last two decades! o Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises! Map Reduce! o pioneered by Google! o popularized by Yahoo! (Hadoop)! Agrawal et al., VLDB 2010 Tutorial!
53 Parallel DBMS technologies Popularly used for more than two decades! o Research Projects: Gamma, Grace,! o Commercial: Multi-billion dollar industry but access to only a privileged few! Relational Data Model! Indexing! Familiar SQL interface! Advanced query optimization! Well understood and studied! Very reliable!! Agrawal et al., VLDB 2010 Tutorial!
54 MapReduce Overview:! o Data-parallel programming model! o An associated parallel and distributed implementation for commodity clusters! Pioneered by Google! o Processes 20 PB of data per day (circa 2008)! Popularized by open-source Hadoop project! o Used by Yahoo!, Facebook, Amazon, and the list is growing! [Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010] Agrawal et al., VLDB 2010 Tutorial!
55 Hadoop! Open source of MapReduce framework of Apache Project! Hadoop Distributed File System (HDFS)! o Store big files across machines! o Store each file as a sequence of blocks! o Blocks of a file are replicated for fault tolerance! Distribute processing of large data across thousands of commodity machines! Key components! o MapReduce - distributes applications! o Hadoop Distributed File System (HDFS) - distributes data! A single Namenode (master) and multiple Datanodes (slaves)! o Namenode: manages the file system and access to files by clients! o Datanode: manages the storages attached to the nodes running on! Kyuseok Shim (VLDB 2012 TUTORIAL)!
56 MapReduce Programming Model! Borrows from functional programming! Users should implement two primary methods:!! o Map: (key1, val1) [(key2, val2)]! o Reduce: (key2, [val2]) [(key3, val3)]! Kyuseok Shim (VLDB 2012 TUTORIAL)!
57 Word Counting with MapReduce! M 1! Documents! Key! Value! Key! Value! Doc1 Doc2 Financial, IMF, Eco nomics, Crisis Financial, IMF, Cris is Documents! Doc3 Economics, Harry Doc4 Financial, Harry, P o^er, Film Doc5 Crisis, Harry, Po^er! Map! Map! Financial! `! `! 1 IMF! 1! Economics! 1! `! Crisis! 1! Financial! 1! `! IMF! 1! Crisis! 1! Economics! `! 1 Harry! 1! Financial! 1! Harry! 1! `! Po^er! 1! Film! 1! Crisis! 1! Harry! 1! `! Po^er! 1! M 2! Kyuseok Shim (VLDB 2012 TUTORIAL)!
58 Word Counting with MapReduce! Doc1 Doc2 Documents! Financial, IMF, Eco nomics, Crisis Financial, IMF, Cris is Documents! Doc3 Economics, Harry Doc4 Financial, Harry, P o^er, Film Doc5 Crisis, Harry, Po^er! Map! Map! Key! Key! Value! Key! Value list! Value! Financial! Financial! 1 Crisis! 1, 1, 1 1 Financial! IMF! 1! Crisis! 1, 1! 1! Financial! Economics! 1! Crisis! 1, 1! 1! IMF! Crisis! 1! Harry! 1, 1, 1! 1! IMF! Harry! 1! Harry! 1, 1, 1! 1! Economics! Film! 1! Harry! 1! 1! Economics! Po^er! 1! Film! 1, 1! 1! Po^er! 1! Po^er! 1! Reduce! Reduce! Key! Financial! IMF! Economics! Crisis! Harry! Film! Po^er! `! `! Value! Before reduce functions are called, for each distinct key, the list of its values is generated! Kyuseok Shim (VLDB 2012 TUTORIAL)!
59 MapReduce Advantages Automatic Parallelization:! o Depending on the size of RAW INPUT DATA è instantiate multiple MAP tasks! o Similarly, depending upon the number of intermediate <key, value> partitions è instantiate multiple REDUCE tasks! Run-time:! o Data partitioning! o Task scheduling! o Handling machine failures! o Managing inter-machine communication! Completely transparent to the programmer/analyst/user! Agrawal et al., VLDB 2010 Tutorial!
60 MapReduce Experience Runs on large commodity clusters:! o 1000s to 10,000s of machines! Processes many terabytes of data! Easy to use since run-time complexity hidden from the users! 1000s of MR jobs/day at Google (circa 2004)! 100s of MR programs implemented (circa 2004)! Agrawal et al., VLDB 2010 Tutorial!
61 The Need Special-purpose programs to process large amounts of data: crawled documents, Web Query Logs, etc.! At Google and others (Yahoo!, Facebook):! o Inverted index! o Graph structure of the WEB documents! o Summaries of #pages/host, set of frequent queries, etc.! o Ad Optimization! o Spam filtering! Agrawal et al., VLDB 2010 Tutorial!
62 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance! Principal philosophies:! o Make it scale, so you can throw hardware at problems! o Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance)! Hive and Pig further simplify programming! MapReduce is not suitable for all problems, but when it works, it may save you a lot of time! Agrawal et al., VLDB 2010 Tutorial!
63 Map Reduce vs Parallel DBMS! Parallel DBMS! MapReduce! Schema Support! ü! Not out of the box! Indexing! ü! Not out of the box! Programming Model! Optimizations (Compres sion, Query! Optimization)! Declarative! (SQL)! ü! Imperative! (C/C++, Java, )! Extensions through! Pig and Hive! Not out of the box! Flexibility! Not out of the box! ü! Fault Tolerance! Agrawal et al., VLDB 2010 Tutorial! Coarse grained! techniques! [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] ü!
64 MapReduce: A step backwards? Don t need 1000 nodes to process petabytes:! o Parallel DBs do it in fewer than 100 nodes! No support for schema:! o Sharing across multiple MR programs is difficult! No indexing:! o Wasteful access to unnecessary data! Non-declarative programming model:! o Requires highly-skilled programmers! No support for JOINs:! o Requires multiple MR phases for the analysis! We will study this in more detail! Agrawal et al., VLDB 2010 Tutorial!
65 Analysis and Mining
66 Data Analysis and Mining Many challenges, even when data is not big! Data cleaning and curation:! o Detection and correction of errors in data E.g., age = 150.! o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company!
67 Data Analysis and Mining Many challenges, even when data is not big! Data cleaning and curation:! o Detection and correction of errors in data E.g., age = 150.! o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company! Visualization: Pictures help us to think! o Substitute perception for cognition! o External memory: free up limited cognitive/memory resources for higher-level problems! Mining: Discovery of useful, possibly unexpected, patterns in data!
68 (Big) Data Analysis Pipeline
69 Data Analysis and Mining In exploratory tasks, change is the norm!! o Data analysis and mining are iterative processes! o Many trial-and-error steps! Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005
70 Data Analysis and Mining In exploratory tasks, change is the norm!! o Data analysis and mining are iterative processes! o Many trial-and-error steps, easy to get lost! Need to manage the data exploration process:! o Guide users! o Need provenance for reproducibility [Freire et al., CISE 2008]! Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005
71 Analyzing and Mining Big Data: Issues Besides scalability for algorithms and computations! A big data-mining risk is that you will discover patterns that are meaningless.! Statisticians call it Bonferroni s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find junk.! Jeff Ullman s lecture notes!
72 Examples of Bonferroni s Principle 1. A big objection to Total Information Awareness (TIA) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy.! 2. The Rhine Paradox: a great example of how not to conduct scientific research.! Jeff Ullman s lecture notes!
73 Stanford Professor Proves Tracking Terrorists Is Impossible! Reporter from the LA Times picked an example in Professor Ullman s class! Despite attempts by Professor Ullman, the reporter was unable to grasp the point that the story was made up to illustrate Bonferroni s Principle, and was not real! Modified from Jeff Ullman s lecture notes!
74 The TIA Example Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.! We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.! Jeff Ullman s lecture notes!
75 TIA Example: Details 10 9 people being tracked.! 1000 days.! Each person stays in a hotel 1% of the time (10 days out of 1000).! Hotels hold 100 people (so 10 5 hotels to hold 1% of the people being tracked).! If everyone behaves randomly (i.e., no evil-doers) will the data mining detect anything suspicious?! Jeff Ullman s lecture notes!
76 TIA Example: Calculations (1) p at some hotel q at some hotel Same hotel Probability that given persons p and q will be at the same hotel on given day d :! o 1/100 1/100 1/10 5 = 10-9! Probability that p and q will be at the same hotel on given days d 1 and d 2 :! o = 10-18! Pairs of days: C(1000,2) = 1000!/(1000-2)!*2!! o ~5 10 5! Jeff Ullman s lecture notes!
77 TIA Example: Calculations (2) Probability that p and q will be at the same hotel on some two days:! o = ! Pairs of people: C(10 9 2) =~! o ! Expected number of suspicious pairs of people:! o = 250,000.! Jeff Ullman s lecture notes!
78 Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice.! Analysts have to sift through 250,000 candidates to find the 10 real cases.! o Not gonna happen.! o But how can we improve the scheme?! Jeff Ullman s lecture notes!
79 Moral When looking for a property (e.g., two people stayed at the same hotel twice ), make sure that the property does not allow so many possibilities that random data will surely produce facts of interest.! Jeff Ullman s lecture notes!
80 Rhine Paradox (1) Joseph Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception.! He devised (something like) an experiment where subjects were asked to guess 10 hidden cards red or blue.! He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!! Jeff Ullman s lecture notes! 80!
81 Rhine Paradox (2) He told these people they had ESP and called them in for another test of the same type.! Alas, he discovered that almost all of them had lost their ESP.! What did he conclude?! You shouldn t tell people they have ESP; it causes them to lose it.! Jeff Ullman s lecture notes! 81!
82 Moral Understanding Bonferroni s Principle will help you look a little less stupid than a parapsychologist.! Jeff Ullman s lecture notes! 82!
83 Next Class Introduction to Map-Reduce and high-level data processing languages!