Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Size: px

Start display at page:

Download "Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo."

Kelly Bond
8 years ago
Views:

1 Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com

2 Hadoop Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware Adapted from the slides of Donald Miner

3 HDFS Works on top of native (for example ext3, xfs, etc.) file system Data is organized into files & directories Files are divided into blocks, (64-128MB) Files are distributed across cluster nodes Files are write-once The location of blocks can be used to optimize the Map/Reduce execution Blocks are replicated for fault tolerance Data integrity is ensured via checksums HDFS is not good for random reads HDFS is optimized for steaming reads of files HDFS is based on design of Google File System

across cluster nodes Files are write-once The location of blocks can be used to optimize the Map/Reduce execution Blocks are

4 Map/Reduce Paradigm Jobs are described in terms of Mappers and Reducers Mappers receive input records and eject key/value pairs Pairs from mappers are automatically Grouped by the key Sorted for each reducer Reducers get key/value pairs and emit the key/value result/s

Pairs from mappers are automatically Grouped by the key Sorted for

5 Example 1: words count

6 Mapper class

7 Reducer class

Distribute the documents among K computers To be or not to be. Map (to,1), (to, 1) (be,1), (be,1) (or,1),. b e. Map. t o. Map.. Map. For each doc, return a set of (word,frequency) pairs (to,1,1,.

8 Distribute the documents among K computers To be or not to be. Map (to,1), (to, 1) (be,1), (be,1) (or,1),. b e. Map. t o. Map.. Map. For each doc, return a set of (word,frequency) pairs (to,1,1,..), Redu ce To: 180 (be,1, 1), (come,1, 1,1), Redu ce Be: 251 Come: 123 Redu ce Count the occurrences of each word

t o. Map.. Map. For each doc, return a set of (word,frequency) pairs (to,1,1,.

9 Example 2: inner join from MapReduce design patterns book MapReduce design patterns by Donald Miner and Adam Shook

10 Mapper class: users records MapReduce design patterns by Donald Miner and Adam Shook

11 Mapper class: comments records MapReduce design patterns by Donald Miner and Adam Shook

13 Reducer: The actual join logic

14 Cool things about Hadoop No schema imposed- decide what you want when loading Keep full original data! Store anything media, text, logs Transparent Parallelism and network programming. Fault tolerance Blocks are replicated Only active nodes get assigned to jobs Map-Reduce can handle for slow mappers jobs - a dupe of a slow running mapper is created automatically and the results of the first finishing mapper will be used Scalability Check it

Fault tolerance Blocks are replicated Only active nodes get assigned to jobs Map-Reduce can handle for slow mappers jobs - a

15 Hadoop eco system Higher-level languages like Pig and Hive Cascalog

16 Pig Pig is a SQL-like query language that computes using MapReduce jobs It is higher-level than Map/Reduce: FOREACH, GROUP BY,JOIN, DISTINCT, FILTER etc. Custom loaders and storage functions Reads both structured and unstructured data It is a Data flow language

17 Why to use PIG Easier to adopt by non-java programmers No-compilations runs Faster to write (not necessarily faster to execute) Word count example A = load './input.txt'; B = foreach A generate flatten(tokenize((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './wordcount ; Join example A = JOIN comments BY userid, users BY userid; Built-in functions - count, group by, joins, filter Built-in optimizations of executions Can still use map/reduce from pig (use mapreduce keyword) Very good for quick analytics

txt'; B = foreach A generate flatten(tokenize((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D

18 Pig drawbacks Might be clumsy to write tests for (but usually you don t need tests for one-off analytics) But cool for development- use Hawk! You can t do everything (for example, ifs) Pig is not good for Advanced string manipulations (can use UDFs) Complex joins Math Complex aggregates Iterative algorithms But the majority can be addressed with UDF Hard to reuse code (macros have limited functionality)

You can t do everything (for example, ifs) Pig is not good for Advanced string manipulations (can use

19 Pig UDF REGISTER mylibrary.jar; DEFINE ToUpperCase com.mine.pig.udf. ToUpperCase(); A = LOAD words_data' AS (word: chararray, position: int); B = FOREACH A GENERATE ToUpperCase(word);

20 Cascalog Cascalog - a compiler that produces sequences of Map-Reduce programs Clojure-based (functional programming language) Compiles to Java byte code => can access directly all your Java-based code Granular testing and mocking Runs directly on Hadoop and EMR Wide variety of built-in functionality Inner and outer joins Aggregators Functions Subqueries Sorting High performance Check it out

variety of built-in functionality Inner and outer joins Aggregators Functions Subqueries Sorting High performance Check it out

21 Example 1: clojure * 3 5 Examples Check it out

22 More examples Inner join user=> (?<- (stdout) [?person?age?gender] (age?person?age) (gender?person?gender)) Full outer join user=> (?<- (stdout) [?person!!age!!gender] (age?person!!age) (gender?person!!gender)) Count of followers user=> (?<- (stdout) [?person?count] (person?person) (follows?person!!follower) (c/!count!!follower :>?count)) The numbers that equal their squares user=> (?<- (stdout) [?n] (integer?n) (*?n?n :>?n)) Cascalog detects that we are trying to rebind the?n variable and will automatically filter out tuples where the output of the * predicate is not equal to the input. Check it out

23 What s hot in Big Data Arena in New York Etsy Foursquare Spotify Knewton IntentMedia

24 Etsy s skyline Etsy the world s largest hand-made vintage market place Practice continuous development (30-60 deploys per day) Optimized for recovering from failure, rather than avoiding it Bunch of metrics (250K) are outputted and routed to failure detection software Skyline Kind of real time approx. 70 seconds lag Runs on 150 nodes hadoop cluster Check it out

25 Skyline: continued Anomalies are detected through consensus model A metric is anomalous if it latest value is over 3 s.d. above its moving average (statistical process control) By histogram By linear regression (distribution of residuals) Exponentially weighted moving averages (time series with decay factor)

26 Skyline: continued Problems Seasonality Spike influence (raises the moving average) Normality Parameters As of now, generates too much of noise

27 Spotify Swedish company that allows users to search for songs and play them on demand 20m tracks, 20K more are added per day Runs on Hadoop 700 nodes cluster Trying to Recommend music to users Provide Intelligent search functionality Recommendations precomputed overnight Collaborative-filtering type Use signals like time user started streaming the track, when did she stop, ip address location, no rating info (can use number of streams) Build vectors (fingerprint) of users and tracks Use cos to find top scoring recommendations Algos: matrix factorization, probabilistic latent semantic filtering, k- nearest neighbors to narrow down the potential candidates for recommendations Problems: new users and new tracks Check it out

28 Foursquare Mobile app that allows to explore the city and connect to friends Utilizes location data Based on people checking-in into the restaurants, events etc 30m people 50m places 3.5b check-ins 5m check-ins per day Use big data for Place recommendation - how to influence users to go to some place Place matching (where the user is checking from) Algos: ensemble of simple models,naïve Bayes, linear models, random forests, Gaussian mixture combined with personal history and friends history Check it out

29 Foursquare Spatial models they compile Gaussian mixture models eg what s the probability of being at this place given the info received from the phone Sentiment detection based on users reviews (Naïve Bayes) Collaborative filtering amazon style- people who like this also like that Real-time places recommendations based on Location Time of day Personal check-in history Friends preferences Venue similarities Aggregate historical data Familiarity

30 Knewton Adaptive learning platform Real-time recommendations tailored for a student Trying to determine what the student should work on next and how to learn it (depending on the learning style visual, geometric approach etc) Their big clients: Arizona State University and University of Alabama. Model model engagement, boredom, frustration, proficiency, the extent to which a student knows or doesn t know a particular topic. Algos: Item Response Theory Model (estimates the probability that a student is able to do something based on an answer to a particular question). Signals: click stream history (did they check review page? Or checked the hint? How long it took them to answer? Did they change their mind when answering a question) Runs on amazon web services Check it out

31 IntentMedia End-to-end solution for e-commerce sites seeking to monetize their website traffic through advertising while still protecting conversions. Online travel agencies convert perhaps 3% to 5% of site visitors IntentMedia can help sites monetize on the rest of the visitors Combines consumer-intent data with Intent Media predictive analysis to serve up competitors ads to consumers who are deemed unlikely to convert on the initial publisher s site. Runs on: Amazon web services, uses Pig, Cascalog, Hadoop Largest job: 25m records, 440 features signals, Check it out

32 Thanks! Q&A?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets