What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from by Ruoming Jin

Transcription

1 data every day 5/5/2016 Introduction to Big Data, mostly from by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. 2 Big Data: 3V s 12+ TBs of tweet data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in M by billion people on the Web by end Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) Maximilien Brice, CERN CERN s Large Hydron Collider (LHC) generates 15 PB a year To extract knowledge all these types of data need to linked together 6 1

2 Velocity (Speed) Real-time/Fast Data Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 7 8 Real-Time Analytics/Decision Requirement Harnessing Big Data Product Recommendations that are Relevant & Compelling Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play Customer Preventing Fraud as it is Occurring & preventing more proactively Friend Invitations to join a Game or Activity that expands business OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 10 The Model Has Changed THE EVOLUTION OF BUSINESS INTELLIGENCE The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data Speed BI Reporting OLAP & Data warehouse Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Scale Scale Big Data: Real Time & Single View Graph Databases Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra Speed 1990 s 2000 s 2010 s 11 2

3 Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not wellsuited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 13 Big Data Technology Cloud Computing IT resources provided as a service Compute, storage, databases, queues Clouds leverage economies of scale of commodity hardware Cheap storage, high bandwidth networks & multicore processors Geographically distributed data centers Offerings from Microsoft, Amazon, Google, 15 Topic 2: Hadoop/ Programming & Data Processing Architecture of Hadoop, HDFS, and Yarn Programming on Hadoop Basic Data Processing: Sort and Join Information Retrieval using Hadoop Data Mining using Hadoop (Kmeans+Histograms) Machine Learning on Hadoop (EM) Hive/Pig Spark vs. on HDFS HBase and Cassandra References References: Hadoop: The Definitive Guide, Tom White, O Reilly Hadoop In Action, Chuck Lam, Manning Doing Data Science, Rachel Schutt and Cathy O Neil, O Reilly Data-Intensive Text Processing with, Jimmy Lin and Chris Dyer ( Good tutorial presentation & examples at: The definitive original paper:

4 Cloud Resources Hadoop on your local machine Hadoop in a virtual machine on your local machine (Pseudo-Distributed on Ubuntu) Hadoop in the clouds with Amazon EC2 Introduction to /Hadoop From Ruoming Jin s Slides, themselves adapted from Jimmy Lin s slides (at UMD) Limitations of Existing Data Analytics Architecture Key Ideas 2 BI Reports + Interactive Apps RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn t Scale Storage Only Grid (original raw data) Mostly Append Collection Instrumentation 2011 Cloudera, Inc. All Rights Reserved. Can t Explore Original High Fidelity Raw Data Archiving = Premature Data Death Scale out, not up Limits of SMP and large shared-memory machines Move processing to the data Cluster may have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability From the mythical man-month to the tradable machine-hour Slides from Dr. Amr Awadallah s Hadoop talk at Stanford, CTO & VPE from Cloudera The datacenter is the computer! Apache Hadoop Scalable fault-tolerant distributed system for Big Data: Data Storage Data Processing A virtual Big Data machine Borrowed concepts/ideas from Google; Open source under the Apache license Core Hadoop has two main systems: Hadoop/: distributed big data processing infrastructure (abstract/paradigm, fault-tolerant, schedule, execution) HDFS (Hadoop Distributed File System): fault-tolerant, highbandwidth, high availability distributed storage More recently (since 2014): Apache Spark on Hadoop/HDFS and... Apache Spark is now the most active open source project in big data with more than 600 contributors within the past year. Image from 4

5 Example: word counts : Big Data Processing Abstraction Millions of documents in Word counts out: brown, 2 fox, 2 the, 3 In practice, before and related technologies: The first 10 computers are easy; The first 100 computers are hard; The first 1000 computers are impossible; But now with, engineers at Google often use computers! What s wrong with 1000 computers? Some will crash while you re working If probability of crash =.001 Then probability of all up = (1-.001) 1000 = 0.37 expects crashes, tracks partial work, keeps going Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004) Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] or simpler All values with the same key (k ) are sent to the same reducer, in k order for each reducer Here [] means a sequence The execution framework handles everything else Spark: has map, reduce as operations, plus others. Hello World : Word Count (String docid, String text): for each word w in text: Emit(w, 1); (String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); This can be done in or Spark 5

6 Runtime Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed FS (later) This description also valid for Spark, but it uses memory more, so can run faster in many cases. Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] All values with the same key are reduced together The execution framework handles everything else Not quite usually, programmers also specify: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce operations and eventual delivery of results to certain partitions combine (k, [v ]) [(k, v )] Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map combine combine combine map b a 1 2 c c 3 6 a c 5 2 b c 7 8 combine b a 1 2 c 9 a c 5 2 b c 7 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c reduce reduce reduce the quick brown fox the fox ate the mouse how now brown cow Word Count Execution Input Shuffle & Sort Output brown: 1,1 fox: 1,1 how:1 now:1 the:1,1,1 ate: 1 cow: 1 mouse: 1 quick: 1 brown, 2 fox, 2 the, 3 r 1 s 1 r 2 s 2 r 3 s 3 Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. Hadoop History Dec 2004 Google GFS paper published July 2005 Nutch uses Apr 2007 Yahoo! on 1000-node cluster Jan 2008 An Apache Top Level Project Jul 2008 A 4000 node test cluster Sept 2008 Hive becomes a Hadoop subproject Feb 2009 The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query. June 2009 On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production. In 2010 Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. On July 27, 2011 they announced the data has grown to 30 PB. Feb 2014: Apache Spark starts as Top-Level Project. Originally developed (2011+) at the University of California, Berkeley's AMPLab, the Spark codebase was at this point donated to Apache (open source). 6

7 Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo! Who uses Hadoop? Example Word Count () public static class Tokenizerper extends per<object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizeritr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word,one); Example Word Count () public static class IntSumr extends r<text,intwritable,text,intwritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); Example Word Count (Driver) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setperclass(tokenizerper.class); job.setcombinerclass(intsumr.class); job.setrclass(intsumr.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); the quick brown fox the fox ate the mouse how now brown cow Word Count Execution Input Shuffle & Sort Output brown: 1,1 fox: 1,1 how:1 now:1 the:1,1,1 ate: 1 cow: 1 mouse: 1 quick: 1 brown, 2 fox, 2 the, 3 An Optimization: The Combiner A combiner is a local aggregation function for repeated keys produced by same map For associative ops. like sum, count, max Decreases size of intermediate data Example: local counting for Word Count: def combiner(key, values): output(key, sum(values)) 7

8 Word Count with Combiner Input & Combine Shuffle & Sort Output User Program (1) submit the quick brown fox the fox ate the mouse the, 2 brown, 2 fox, 2 the, 3 split 0 split 1 split 2 split 3 split 4 (3) read worker worker Master (2) schedule map (2) schedule reduce (5) remote read worker (4) local write worker (6) write output file 0 output file 1 how now brown cow Input files worker phase Intermediate files (on local disk) phase Output files Adapted from (Dean and Ghemawat, OSDI 2004) Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable (i.e. sequential reading of disk for stream processing) A distributed file system is the answer GFS (Google File System) for Google s HDFS (Hadoop Distributed File System) for Hadoop Another example of Clickstream-like data: for each ad viewing, user info and whether they clicked on the ad: {userid, ip, zip, adnum, clicked Want unique users who saw, clicked, by zip First Try First try key as zip: can emit {90210, {0,1 if saw and failed to click, {90210, {1,1 if saw and clicked receives, say: {90210, [{1,1, {0,1, {0,1] This shows three visits, one click, but we don t know if these visits were by different users, so we don t know the number of unique users Second try We need to preserve user identity longer Use {zip, userid as key Value: again {0,1 or {1,1 if saw and clicked emits {{90210,user123, {0,1, etc. r gets info on one user, one zip: {{90210,user123, [{0,1, {1,1] r can process list, emit {90210,user123, {1,2 But not done yet 8

9 Second pass r (pass 1) emits {90210,user123, {1,2 Second reads this, emits its contribution to zip s stats (one user saw and clicked): {90210, {1, 1 Second receives 2210 user reports for this zip: {90210, {{1,1, {0,1, {0,1, And counts up unique users and their clicks: emits {90210, {56, 2210 for 2210 unique users in zip viewed ads, 56 of them clicked. Compare to SQL Table T of {userid, ip, zip, adnum, clicked Using a trick, we can do this in one select: select zip, count (distinct userid), count (distinct clicked*userid) from T group by zip Assumes clicked = 0 or 1 in T row Note that DB2, Oracle, and mysql can do count (distinct expr), though entry SQL92 only requires count(distinct column) Compare to SQL Table T of {userid, ip, zip, adnum, clicked Closer to processing select zip, userid, count (clicked) cc from T group by zip, userid Put results into table T1 (zip, userid, cc) select zip, count(*), sum(sign(cc)) from T1 group by zip Scalar function sign(x) = -1, 0, +1 is available on Oracle, DB2, mysql, but not in Entry SQL92 Do it in SQL92? CASE is the conditional value capability in SQL92, but is not required for Entry SQL92 (it is supported by all respectable DBs) Sign(x) as case: case when x < 0 then -1 when x > 0 then 1 else 0 End Something better? We see that using means telling the system in detail how to solve the problem SQL just states the problem, lets the QP figure out how to do it Next time: Hive, the SQL-like query language built on top of Spark also has a SQL-like language 9