Intro Systems. Big Data Pot Pourri. Olivier Curé. Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France.

Transcription

1 Big Data Pot Pourri Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 7, 2014

2 Bon anniversaire Pierre

3

4

5 Wikipedia definition Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.

6 Meaning of big evolves 2008: Google processes 20 PB a day 2009: Facebook has 2.5 PB user data + 15 TB/day 2009: ebay has 6.5 PB user data + 50 TB/day 2011: Yahoo! has PB of data 2012: Facebook ingests 500 TB/day

7

8

9 3 Vs of big data Big Velocity: ranging from batch processing to real-time Uses cases: electronic trading, real-time ad placement on the Web, mobile socail networking, etc. Big Variety: ranging from structured to unstructured Data sources: xls, html, xml, rdbms, rdf, csv, etc. Tools: ETL (Extract Transform Load), Big volumes: ranging To to Po and more

10 more Vs Veracity : data quality many companies starting to address this area Trifacta, Data Tamer Vocabulary : Semantics Ontologies Venue : location

11

12 Hype cycle 2013

13

14 Why big data Increase of storage capacities Increase of processing power Availability of data The Web

15 Storage capacity

16 Computation capacity $5 million vs $400: price of fastest supercomputer in 1975 and iphone4 with equal performance

17 Data availability

18 Type of available data

19 Data available from Internet of Things

20 Growth of IoT

21 Lack of talent for big data

22 2 forms of Big volumes 1 small analytics : SQL on very large data sets. Using aggregate ops of SQL big analytics : Data clustering, regressions, machine learning. Using statistical tools: R, SPSS (Statistical Product and Service Solutions - IBM), SAS. 1

23 Making sense at scale Machines: cloud computing Algorithms: machine learning and analytics People: crowdsourcing and human computation

24 Crowdsourcing and Human computation Crowdsourcing first coined in 2006 in Wired magazine: a task of taking a job traditionally performed by a designated agent and outsourcing it to an undefined, generally large group of people in the form of an open call. Human computation 2 :... a paradigm for utilizing human processing power to solve problems that computers cannot yet solve. Global Brain 3 : people and computers to constitute a global brain. Ask for new programming metaphors to program it. 2 van Ahn Bernstein, CACM12

25 Machine Learning (aka data mining, predictive analytics) Machine learning systems automatically learn programs from data. Different types of ML: supervised (e.g. decision trees, rules, Bayesian techniques, Neural networks, SVM) and unsupervised learning (e.g. Clustering, Dimensionality reduction). Use cases: Spam filtering, Clickstream mining, Recommendation, etc..

26 Machines Two main solutions to process big data: MR and high compression approches Using them together will be more and more frequent in future systems. For instance by distributing compressed data over a cluster of machines.

27 Related tools

28 Tools using Hadoop Hive: data warehouse infrastructure that provides data summarization and ad hoc querying (HiveQL) Pig: high-level data-flow language and execution framework for parallel computation (Pig Latin) Mahout: Scalable machine learning and data mining library Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Many more: Cascading, Cascalog, mrjob, MapR, Azkaban, Oozie,...

29 @ Twitter Others Graph oriented systems Dremel 4 Scalable, interactive, ad-hoc query system for analysis of read-only nested data Combines multi level exec trees and columnar data layout Operates on in-situ nested data Provides high level, SQL, like language to express queries Executes queries natively without translating them to MR not a replacement for MR, used in conjunction with MR to analyze MR pipelines outputs and prototype larger computations. Data model is ColumnIO Uses a columnar storage approach 4 VLDB 2010

30 Tenzing 5 Twitter Others Graph oriented systems SQL query execution built on top of Map Reduce SQL features: projection, filtering, aggregation, joins, OLAP extensions (cube, rollup), set operations, nested queries, views, analytic functions (rank, sum, min, max) Optimzation and indexes to speed up query execution Not adapted to nested-repeated structures provides better optimization built-in than Pig Latin and Sawzall FlumeJava and DryadLinq are geared for programmers for simple ETL tasks Hive, Scope, Hadapt are competitors of Tenzing. Greenplum, Teradata, Paraccel, Vertica are parallel DB that are embedding MR 5 VLDB 2011

31 @ Twitter Others Graph oriented systems Sawzall Tenzing Dremel Latency High Medium Low Scalability High High Medium SQL None High Medium Power High Medium Low

32 @ Twitter Others Graph oriented systems Google s Megastore 6 Storage system that mixes the scalability of NoSQL and convenience of RDBLS. Strong consistency, high availability ACID semantics with fine-grained partitions of data. Synchronously-replication DB across data centers Layered on top of BigTable, hence high communication costs. 6 CIDR 2011

33 @ Twitter Others Graph oriented systems Google s Spanner 7 Scalable Globally-distributed Synchronously-replicated DB across data centers Higher performance than Megastore. 7 OSDI 2012

34 @ Twitter Others Graph oriented systems Facebook Company s largest cluster is more than 100Po Hive queries/day Datawarehouse has grown 2.500x in 4 years. That forced them to design a better way to process bi data at web scale: Corona

35 @ Twitter Others Graph oriented systems Facebook s Corona A new system for scheduling Hadoop jobs that makes better use of a cluster s resources and also makes it more amenable to multitenant environments. Hadoop s jobtracker node is responsible for both the cluster management and job-scheduling. Thus it is slow. Hadoop s job scheduling involves an inherent delay: problematic for small jobs that require fast exec. Corona creates individual job trackers for each job and a cluster manager that handles tracking nodes and available resources. Apache Yarn and Apache Mesos (used at Twitter) are Corona s competitors.

36 @ Twitter Others Graph oriented systems Netflix in numbers (2012) an American provider of on-demand Internet streaming media Netflix consumes 32.7 percent of the Internet s peak downstream traffic in North America 25 million users 30 million plays per day (tracks user s rewind, foward, pause) more than 2 billion hours of streaming video for the last quarter of million ratings/day 3 million searches/day Geo-location data, device info, Social media data from Facebook and Twitter

37 @ Twitter Others Graph oriented systems Netflix s recommendation system 75% of users selections based on the company s recommendations Main goal: predict what you will watch next and propose it if available on Netflix. Ultimate goal: predict what customers will view to completion Aims to consider volume, colors, scenery configs

38 @ Twitter Others Graph oriented systems

39 @ Twitter Others Graph oriented systems Netflix 8 Considered the king of computing in the cloud Runs almost entirely in AWS platform Uses Hadoop (Elastic Map Reduce, Amazon s MR on AWS) as storage and processing for almost everything. Genie is Netflix s homemade Hadoop Platform as a Service (Paas) : submit jobs via a REST-API Uses both S3 and HDFS as the storage layer: S3 to share the same data among clusters, HDFS to speed up computation process. 8

40 @ Twitter Others Graph oriented systems Twitter Over 140M active user Over 400M visitors 400M tweets/day (peak 25K/sec) Types of data: text, social graph, time series, interest graph What do they do with data: search, recommendations, ads, anti-spam

41 @ Twitter Others Graph oriented systems Twitter (2) When a write a tweet: the tweet enters the WriteAPI which calls the Fanout module to send it to all followers, i.e. stored in a user array of tweets (in Redis) In the Redis cluster, all users s timelines are stored (not persisted, everything in RAM, duplicated 3 times). In case of failover, it can be reconstructed. They keep the last 800 tweets for each user in RAM. Fanout asks the Social Graph service to know who is following who. In redis, data model is tweetid (8bytes), UserID (8bytes), bits (4bytes) plus retweet (tweetid) Timeline service, provides the Redis server where your home timeline is stored.

42 @ Twitter Others Graph oriented systems Twitter (3) The WriteAPI also sends tweets to the Search Ingester then it stores it in a modified Lucene index (named Earlybird). Index is in-memory. Blender is the service that enables to access Earlybird. Twitter also a a pull solution (pulls tweets to users). WriteAPI sends tweets to HTTP Push which contains Hosebird which searches to how to sends that tweet. A similar service exists for mobile devices, named Mobile Push. WriteAPI also sends all tweets to HDFS to run MR jobs.

43 @ Twitter Others Graph oriented systems Spark Fast, MR-like engine In-memory storage for fast iterative queries (in Resilient Distributed Datasets-RDD) vs disk in MR Not restricted to Map and Reduce but has sample, join, group-by ops Up to 100x faster than hadoop (2-10x for on-disk data) Compatible with Hadoop s storage APIs: access HDFS, HBase, S3, SequenceFiles

44 @ Twitter Others Graph oriented systems Spark - Resilient Distributed Datasets-RDD A collection of Java objects Can be partitioned/distributed and shuffled/distributed across a cluster Need not to be in-memory at once At the moment, RDD expire at the end of a job

45 @ Twitter Others Graph oriented systems Shark Port of Apache Hive to run on Spark Compatible with Hive data, queries (HiveQL, UDFs) Up to 100x faster Who uses Spark/Shark: Yahoo!, Foursquare, AirBnb, etc.

46 @ Twitter Others Graph oriented systems Graph processing Hadoop is great at many apps but everything Graph processing is better handled by systems like Google Pregel, Apache Giraph or iterative modeling (MPI).

47 @ Twitter Others Graph oriented systems Bulk Synchronous Parallel (BSP) model 9 An abstract computer to design parallel algorithms. A BSP computer: a set of connected processors. Each processor has local memory and may follow different threads of computation. A BSP computation proceeds in a series of global supersteps. A superstep has 3 components: Concurrent computation on every participating processor. Each process uses values stored in its local memory. Computations execute asynchronously of each others. Communication: processes exchange data via 1-sided put and get calls rather than 2-sided send and receive calls. Barrier synchronization: A point when a process waits for all other processes to finish their comunication actions. It concludes a superstep. Computation and commnunication actions are necessarily timely ordered. 9 L. Valiant, CACM 1990

48 @ Twitter Others Graph oriented systems Bulk Synchronous Parallel (BSP) model (2) Processes are randmoly assigned to processors The problem to solve is splitted into more logical processes than there is physical processors. One-sided communication prevents from deadlocks (no circular dependencies), permits fault tolerance. Apache Hama is pure BSP computing framework on top of HDFS. Pregel and Giraph both follow this model.

49 @ Twitter Others Graph oriented systems Pregel and Giraph programs are expressed as a sequence of iterations In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges states, and mutate the graph s topology

50 @ Twitter Others Graph oriented systems Haloop 10 a modified version of Hadoop MR framework to support iterative programs. Task scheduler is loop-aware. Adds various caching mechanisms. 10 VLDB 2010

51 @ Twitter Others Graph oriented systems Message Passing Interface (MPI) a language-independent communications protocol used to program parallel computers. Point to point and collective communication are supported Goals: high performance, scalability and portability.

52 @ Twitter Others Graph oriented systems Polyglot persistence Term coined after Neal Frod s Polygot programming, asking to write programs with a mix prog. languages. Polyglot persistence aims to use different different data stores in your applications. Imagine a e-commerce application. What would you use for the shopping cart, the completed orders and session data?

53 @ Twitter Others Graph oriented systems The shopping cart and the session data can be efficiently stored in a Key-Value store. Respectively, their keys are userid and sessionid. Once an order is completed, that data can be stored in an RDBMS or a Document store. What if we want to add a product recommendation service? Thing Collaborative Filtering, those who bought that product also like that product or your friends bought.. What about inventory and item prices?

54 @ Twitter Others Graph oriented systems A graph database corresponds to storing recommendation data. Inventory and item prices fit nicely in an RDBMS. If we have a lot of text, we can index that text using a store like Solr (part of the Lucene project). With Polyglot Persistence, one has to be careful with deployment complexity: all databases are needed in production at the same time. It may be a got solution to design services on these databases. It reduces the impact of data storage choices.