Real World Hadoop Use Cases



Similar documents
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

The Fast-Track to Hands-On Understanding of Big Data Technology

Harnessing the Potential Raj Nair

Introduction To Hive

Business Intelligence for Big Data

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Constructing a Data Lake: Hadoop and Oracle Database United!

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Integration of Apache Hive and HBase

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Job Oriented Training Agenda

Parquet. Columnar storage for the people

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

How To Scale Out Of A Nosql Database

Real Time Opinion Mining of Twitter Data

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Moving From Hadoop to Spark

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Using distributed technologies to analyze Big Data

Native Connectivity to Big Data Sources in MSTR 10

Data Analyst Program- 0 to 100

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

White Paper: What You Need To Know About Hadoop

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Qsoft Inc

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

ITG Software Engineering

Hadoop Ecosystem B Y R A H I M A.

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

SQL on NoSQL (and all of the data) With Apache Drill

Hadoop and MySQL for Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Self-service BI for big data applications using Apache Drill

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Using RDBMS, NoSQL or Hadoop?

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Big Data Course Highlights

Drools Complex Event Processing with Twitter4J. Toshiya Kobayashi Red Hat Senior Software Maintenance Engineer

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Tap into Hadoop and Other No SQL Sources

THE PLATFORM FOR BIG DATA

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Real-Time Data Analytics and Visualization

Self-service BI for big data applications using Apache Drill

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Workshop on Hadoop with Big Data

Replicating to everything

Real Time Analytics for Big Data. NtiSh Nati

Cloudera Certified Developer for Apache Hadoop

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Data processing goes big

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Hadoop: The Definitive Guide

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop implementation of MapReduce computational model. Ján Vaňo

How To Use Big Data For Telco (For A Telco)

Connecting Hadoop with Oracle Database

Hadoop & Spark Using Amazon EMR

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Information Architecture

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Too Big To Ignore

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Complete Java Classes Hadoop Syllabus Contact No:

Hive Interview Questions

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

}w!"#$%&'()+,-./012345<ya

Testing 3Vs (Volume, Variety and Velocity) of Big Data

AtScale Intelligence Platform

Hadoop Distributed File System. -Kishan Patel ID#

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Transcription:

Real World Hadoop Use Cases JFokus 2013, Stockholm Eva Andreasson, Cloudera Inc. Lars Sjödin, King.com 1 2012 Cloudera, Inc.

Agenda Recap of Big Data and Hadoop Analyzing Twitter feeds with Hadoop Real world Hadoop use case Featuring King.com Q&A 2

Big Data? Big Data Increased volumes of data Increased speed of incoming data Increased variety of data types Challenges Stress on traditional systems Process more data within same time window ETL / Cleansing of exponential ingest amounts and new data types Inflexible models for when questions change Siloed data / organizations preventing most value 3

4 Hadoop Distributed File System (HDFS)

MapReduce: A scalable data processing framework

REAL WORLD EXAMPLE #1 ANALYZING TWITTER DATA WITH HADOOP 6

Analyzing Twitter Social media popular with marketing teams Twitter is an effective tool for promotion But how do we find out who is most influential: Who is influential and has the most followers? Which Twitter user gets the most retweets? Who is influential in our industry? 7 2012 Cloudera, Inc.

Techniques SQL Filter on industry Aggregate tweets by original poster and count retweets Sort Complex data Deeply nested Variable schema Size of data set Hadoop! 8

Flume Streaming data flow (like Twitter) Sources Push or pull Sinks Event based 9 2012 Cloudera, Inc.

Pulling data From Twitter Custom source, using twitter4j Source will process data as discrete events Filter on key words Sink writes to files in HDFS

Loading data into HDFS HDFS Sink comes stock with Flume Easily separate files by creation time hdfs://hadoop1:8020/user/flume/tweets/%y/%m/%d/%h/

Outline of Flume Source for Tweets public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable {... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(context context) {... }... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() {... }... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() {... } } 12 2012 Cloudera, Inc.

Twitter API Callback mechanism for catching new tweets /** The actual Twitter stream. It's set up to collect raw JSON data */ private final TwitterStream twitterstream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()).getInstance();... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onstatus method is executed every time a new tweet comes in. public void onstatus(status status) {... } }... // Set up the stream's listener (defined above), and set any necessary // security information. twitterstream.addlistener(listener); twitterstream.setoauthconsumer(consumerkey, consumersecret); AccessToken token = new AccessToken(accessToken, accesstokensecret); twitterstream.setoauthaccesstoken(token); 13 2012 Cloudera, Inc.

JSON data JSON data is processed as an event and written to HDFS public void onstatus(status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getcreatedat().gettime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processevent(event); } 14 2012 Cloudera, Inc.

What is Hive? HiveQL SQL like interface Hive interpreter converts HiveQL to MapReduce code Returns results to the client 15 2012 Cloudera, Inc.

Hive details Schema on read Scalar types (int, float, double, boolean, string) Complex types (struct, map, array) Metastore contains table definitions Allows queries to be data agnostic Stored in a relational database Similar to catalog tables in other DBs 16

Hive Serializers and Deserializers (SerDe) Instructs Hive on how to interpret data JSONSerDe Hive Strenghts: Flexible in the data model Extendable format support 17 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop PUTTING IT ALL TOGETHER 18 2012 Cloudera, Inc.

Architecture Query through your favorite SQL tool Custom Flume Source Flume Sink to HDFS HDFS JSON SerDe Parses Data Hive 19 2012 Cloudera, Inc.

Now We Can Start Asking Bigger Questions SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10; 20 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop TEASER: FASTER HIVE? GO IMPALA! 21

Try it out yourself? Cloudera provides demo VMs https://ccp.cloudera.com/display/support/cloudera+ma nager+free+edition+demo+vm More info and examples http://blog.cloudera.com/

Beyond Big and Data Prelude to a Philosophy of the BI Future Lars Sjödin

24 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop EXTRA SLIDES 25

NOTE: Hive is not a database Language Update Capabilities RDBMS Generally >= SQL-92 INSERT, UPDATE, DELETE Hive Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes Subset of SQL-92 plus Hive specific extensions INSERT OVERWRITE no UPDATE, DELETE 26 2012 Cloudera, Inc.

My personal preference to reduce complexity Cloudera Manager https://ccp.cloudera.com/display/support/downloads Free up to 50 nodes

Analyzing Twitter data with Hadoop JSON INTERLUDE 28 2012 Cloudera, Inc.

What is JSON? Complex, semi-structured data Based on JavaScript s data syntax Rich, nested data types: number string Array object true, false null 29 2012 Cloudera, Inc.

What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } } 30 2012 Cloudera, Inc.

Analyzing Twitter data with Hadoop OOZIE: AUTOMATION 31 2012 Cloudera, Inc.

Oozie: everything in its right place

Oozie for partition management Once an hour, add a partition Takes advantage of advanced Hive functionality