How To Use Facebook Data From A Microsoft Microsoft Hadoop On A Microsatellite On A Web Browser On A Pc Or Macode On A Macode Or Ipad On A Cheap Computer On A Network Or Ipode On Your Computer

Similar documents
Xiaoming Gao Hui Li Thilina Gunarathne

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Connecting Hadoop with Oracle Database

PHP Language Binding Guide For The Connection Cloud Web Services

Hadoop WordCount Explained! IT332 Distributed Systems

Big Data Hive! Laurent d Orazio

Hadoop Distributed File System. -Kishan Patel ID#

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Using distributed technologies to analyze Big Data

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Oracle Database: SQL and PL/SQL Fundamentals

Internals of Hadoop Application Framework and Distributed File System

Oracle Database: SQL and PL/SQL Fundamentals NEW

Apache Hive. Big Data 2015

Hadoop, Hive & Spark Tutorial

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Real World Hadoop Use Cases

Accessing Data with ADOBE FLEX 4.6

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Getting to know Apache Hadoop

The Hadoop Eco System Shanghai Data Science Meetup

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Integration Guide

Service Oriented Architecture

Big Data for the JVM developer. Costin Leau,

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

American International Journal of Research in Science, Technology, Engineering & Mathematics

Oracle SQL. Course Summary. Duration. Objectives

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Data processing goes big

A Generic Database Web Service

Research on the Model of Enterprise Application Integration with Web Services

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Hadoop and Big Data Research

BIG DATA, MAPREDUCE & HADOOP

Oracle Database 10g: Introduction to SQL

Business Application Services Testing

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Oracle Database: SQL and PL/SQL Fundamentals

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

COURSE CONTENT Big Data and Hadoop Training

CASE STUDY OF HIVE USING HADOOP 1

Oracle Service Bus Examples and Tutorials

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Important Notice. (c) Cloudera, Inc. All rights reserved.

Introduction To Hive

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Qlik REST Connector Installation and User Guide

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Lofan Abrams Data Services for Big Data Session # 2987

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Oracle Database 12c: Introduction to SQL Ed 1.1

Data Domain Profiling and Data Masking for Hadoop

Qsoft Inc

Implement Hadoop jobs to extract business value from large and varied data sets

Specialized Programme on Web Application Development using Open Source Tools

Integrating VoltDB with Hadoop

Big Data. Facebook Friends Data on Amazon Elastic Cloud

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Introduction and Overview for Oracle 11G 4 days Weekends

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Module 1: Getting Started with Databases and Transact-SQL in SQL Server 2008

Cloudera Certified Developer for Apache Hadoop

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Hadoop Ecosystem B Y R A H I M A.

Writing Queries Using Microsoft SQL Server 2008 Transact-SQL

Querying Microsoft SQL Server

CS506 Web Design and Development Solved Online Quiz No. 01

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

MySQL for Beginners Ed 3

FileMaker 11. ODBC and JDBC Guide

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Oracle Big Data SQL Technical Update

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

XML Processing and Web Services. Chapter 17

Building Web-based Infrastructures for Smart Meters

Introduction to Service Oriented Architectures (SOA)

Oracle 10g PL/SQL Training

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

Hive Interview Questions

Beyond The Web Drupal Meets The Desktop (And Mobile) Justin Miller Code Sorcery Workshop, LLC

Best Practices for Hadoop Data Analysis with Tableau

Oracle Warehouse Builder 10g

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

CS54100: Database Systems

Hadoop Integration Guide

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Virtual Credit Card Processing System

Transcription:

Introduction to Big Data Science 14 th Period Retrieving, Storing, and Querying Big Data Big Data Science 1

Contents Retrieving Data from SNS Introduction to Facebook APIs and Data Format K-V Data Scheme on Hadoop Storing and Querying Data on Hive Using Map-Reduce Programming for SA Big Data Science 2

Distributed Objects Objects that can communicate with objects on heterogeneous run-time environments Distribute Objects Standard Protocol ex: JRMP Robust Reliable Transparent Distributed Objects Technology Multi-Platform Transparent access to distributed objects Language Neutral: RMI, CORBA, DCOM Big Data Science

Java Remote Method Invocation (RMI) Can use objects on remote different runtime environments as like objects on a local run-time environment Abstraction of low-level network code on distributed network to provide developers an environment where they focus on their application development. Big Data Science

CORBA Contributions CORBA addresses two challenges of developing distributed system: Making distributed application development no more difficult than developing centralized programs. Easier said than done due to : Partial failures Impact of latency Load balancing Event Ordering Providing an infrastructure to integrate application components into a distributed system i.e., CORBA is an "enabling technology" Big Data Science

APIs on the Web Web Service Standard: Recommended by W3C, Robust and Fast, but Not Easy to use Simple Object Access Protocol (SOAP) Simple XML Message Remote Procedure Call Web Service Description Language (WSDL) Specification of Web Service Function Universal Description, Discovery, and Integration (UDDI) Create, Store, Search information Big Data Science 6

APIs on the Web RESTful Web API: No Standard by Some Authorities, but Easy to Use Representational state transfer (REST) is an architectural style consisting of a coordinated set of constraints applied to components, connectors, and data elements, within a distributed hypermedia system. REST ignores the details of component implementation and protocol syntax in order to focus on the roles of components, the constraints upon their interaction with other components, and their interpretation of significant data elements. REST has been applied to describe desired web architecture, to identify existing problems, to compare alternative solutions, and to ensure that protocol extensions would not violate the core constraints that make the Web successful. Fielding used REST to design HTTP 1.1 and Uniform Resource Identifiers (URI). The REST architectural style is also applied to the development of Web services as an alternative to other distributed-computing specifications such as SOAP. Big Data Science 7

Retrieving Data from SNS Social Network Services (SNS) provide useful API for accessing their data. Usually, they provide it in the form of Web API, Web programming, Smart Phone SDK. It is almost impossible for us to retrieve all data, but we can save what we need for special purpose to a long time big data storage. Big Data Science 8

Web APIs for Web and Several SNS Facebook API Graph API Open Graph Dialogs Chat Ads API FQL Localization and translation Atlas API Public Feed API Keyword Insights API Twitter API Google API Big Data Science 9

REST API v1.1 Resources Timelines Twitter API Collections of Tweets, ordered with the most recent first. Tweets The atomic building blocks of Twitter, 140-character status updates with additional associated metadata. People tweet for a variety of reasons about a multitude of topics. Search Find relevant Tweets based on queries performed by your users. Streaming Direct Messages Short, non-public messages sent between two users. Access to Direct Messages is governed by the The Application Permission Model. Big Data Science 10

Twitter API Friends & Followers Users follow their interests on Twitter through both one-way and mutual following relationships. Users Users are at the center of everything Twitter: they follow, they favorite, and tweet & retweet. Suggested Users Categorical organization of users that others may be interested to follow. Favorites Users favorite tweets to give recognition to awesome tweets, to curate the best of Twitter, to save for reading later, and a variety of other reasons. Likewise, developers make use of "favs" in many different ways. Big Data Science 11

Lists Twitter API Collections of tweets, culled from a curated list of Twitter users. List timeline methods include tweets by all members of a list. Saved Searches Allows users to save references to search criteria for reuse later. Places & Geo Users tweet from all over the world. These methods allow you to attach location data to tweets and discover tweets & locations. Trends With so many tweets from so many users, themes are bound to arise from the zeitgeist. The Trends methods allow you to explore what's trending on Twitter. Spam Reporting These methods are used to report user accounts as spam accounts. Big Data Science 12

Graph API Facebook APIs The Graph API is a simple HTTP-based API that gives access to the Facebook social graph, uniformly representing objects in the graph and the connections between them. Most other APIs at Facebook are based on the Graph API. Open Graph The Open Graph API allows apps to tell stories on Facebook through a structured, strongly typed API. Dialogs Facebook offers a number of dialogs for Facebook Login, posting to a person's timeline or sending requests. Chat You can integrate Facebook Chat into your Web-based, desktop, or mobile instant messaging products. Your instant messaging client connects to Facebook Chat via the Jabber XMPP service. Big Data Science 13

Ads API Facebook APIs The Ads API allows you to build your own app as a customized alternative to the Facebook Ads Manager and Power Editor tools. FQL Facebook Query Language, or FQL, enables you to use a SQLstyle interface to query the data exposed by the Graph API. It provides for some advanced features not available in the Graph API such as using the results of one query in another. Localization and translation Facebook supports localization of apps. Read about the tools we provide. Atlas API The Atlas APIs provides you with programmatic access to the Atlas web services. Big Data Science 14

Public Feed API Facebook APIs The Public Feed API lets you read the stream of public comments as they are posted to Facebook. Keyword Insights API The Keyword Insights API exposes an analysis layer on top of all Facebook posts that enables you to query aggregate, anonymous insights about people mentioning a certain term. Big Data Science 15

Facebook Tables Facebook Query APIs: FQL Big Data Science 16

Fields of comment table Facebook Query APIs: FQL Big Data Science 17

Facebook APIs Running Example Example Runs the query "SELECT uid2 FROM friend WHERE uid1=me()" https://developers.facebook.com/tools/explorer?method=get&pat h=fql%3fq%3dselect+uid2+from+friend+where+uid1%3d me%28%29 Read You can issue a HTTP GET request to /fql?q=query where query can be a single fql query or a JSON-encoded dictionary of queries. Query Queries are of the form SELECT [fields] FROM [table] WHERE [conditions]. Unlike SQL, the FQL FROM clause can contain only a single table. You can use the IN keyword in SELECT or WHERE clauses to do subqueries, but the subqueries cannot reference variables in the outer query's scope. Your query must also be indexable, meaning that it queries properties that are marked as indexable in the documentation below. Big Data Science 18

<?php $app_id = 'YOUR_APP_ID'; $app_secret = 'YOUR_APP_SECRET'; $my_url = 'POST_AUTH_URL'; $code = $_REQUEST["code"]; FQL Example // auth user if(empty($code)) { $dialog_url = 'https://www.facebook.com/dialog/oauth?client_id='. $app_id. '&redirect_uri='. urlencode($my_url) ; echo("<script>top.location.href='". $dialog_url. "'</script>"); } // get user access_token $token_url = 'https://graph.facebook.com/oauth/access_token?client_id='. $app_id. '&redirect_uri='. urlencode($my_url). '&client_secret='. $app_secret. '&code='. $code; // response is of the format "access_token=aaac..." $access_token = substr(file_get_contents($token_url), 13); Big Data Science 19

FQL Example // run fql query $fql_query_url = 'https://graph.facebook.com/'. 'fql?q=select+uid2+from+friend+where+uid1=me()'. '&access_token='. $access_token; $fql_query_result = file_get_contents($fql_query_url); $fql_query_obj = json_decode($fql_query_result, true); // display results of fql query echo '<pre>'; print_r("query results:"); print_r($fql_query_obj); echo '</pre>'; // run fql multiquery $fql_multiquery_url = 'https://graph.facebook.com/'. 'fql?q={"all+friends":"select+uid2+from+friend+where+uid1=me()",'. '"my+name":"select+name+from+user+where+uid=me()"}'. '&access_token='. $access_token; $fql_multiquery_result = file_get_contents($fql_multiquery_url); $fql_multiquery_obj = json_decode($fql_multiquery_result, true); // display results of fql multiquery echo '<pre>'; print_r("multi query results:"); print_r($fql_multiquery_obj); echo '</pre>';?> Big Data Science 20

Map-Reduce for Multiple Outputs Parallel Execution of Map-Reduce Program To give several control flow in Map operation, we can use GenericOptionsParser, but that kinds of way can decrease performance severely for a big data. MultipleOutputs provides a trick of parallel processing of Map- Reduce job by multiple output data. org.apache.hadoop.mapreduce.lib.output.multipleoutputs Provides function of creating multiple output data. Creating multiple OutputCollectors, and setting output path, output format, key, and value type for OutputCollectors. It creates different data to that the existing Map-Reduce program outputs. When Map-Reduce job finished, a output data part-r-nnnnn is to be created in the Reduce stage. If a programmer creates data on a directory myfile using MultipleOutputs, part-r-nnnnn and myfile-r-nnnnn are created at the same time. Big Data Science 21

Mapper Implementation for MultipleOutputs public class DelayCountMapperWithMultipleOutputs extends Mapper<LongWritable, Text, Text, IntWritable> { // map output value private final static IntWritable outputvalue = new IntWritable(1); // map output key private Text outputkey = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { if (key.get() > 0) { String[] colums = value.tostring().split(","); if (colums!= null && colums.length > 0) { try { // Departure dealy data output if (!colums[15].equals("na")) { int depdelaytime = Integer.parseInt(colums[15]); if (depdelaytime > 0) { // Output key set outputkey.set("d," + colums[0] + "," + colums[1]); // Output data creation context.write(outputkey, outputvalue); } else if (depdelaytime == 0) {context.getcounter( DelayCounters.scheduled_departure).increment(1); } else if (depdelaytime < 0) { context.getcounter(delaycounters.early_departure).increment(1); } Big Data Science 22

Mapper Implementation for MultipleOutputs } else { context.getcounter(delaycounters.not_available_departure).increment(1); } // Arrival Delay Data Output if (!colums[14].equals("na")) { int arrdelaytime = Integer.parseInt(colums[14]); if (arrdelaytime > 0) { // Output Key Setting outputkey.set("a," + colums[0] + "," + colums[1]); // Output Data Creation context.write(outputkey, outputvalue); } else if (arrdelaytime == 0) { context.getcounter(delaycounters.scheduled_arrival).increment(1); } else if (arrdelaytime < 0) { context.getcounter(delaycounters.early_arrival).increment(1); } } else { context.getcounter(delaycounters.not_available_arrival).increment(1); } } catch (Exception e) { e.printstacktrace(); } } } } } Big Data Science 23

Reducer Implementation for MultipleOutputs public class DelayCountReducerWithMultipleOutputs extends Reducer<Text, IntWritable, Text, IntWritable> { private MultipleOutputs<Text, IntWritable> mos; // reduce Output Key private Text outputkey = new Text(); // reduce Output Value private IntWritable result = new IntWritable(); @Override public void setup(context context) throws IOException, InterruptedException { mos = new MultipleOutputs<Text, IntWritable>(context); } public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // Split by comma String[] colums = key.tostring().split(","); // Output Key Setting outputkey.set(colums[1] + "," + colums[2]); // Departure Delay if (colums[0].equals("d")) { // Delay count sum int sum = 0; for (IntWritable value : values) { sum += value.get(); } // Output Value Setting result.set(sum); // Output Data Setting mos.write("departure", outputkey, result); Big Data Science 24

Reducer Implementation for MultipleOutputs // Arrival Delay } else { // Delay count sum int sum = 0; for (IntWritable value : values) { sum += value.get(); } // Output value setting result.set(sum); // Output Data Creation mos.write("arrival", outputkey, result); } } @Override public void cleanup(context context) throws IOException, InterruptedException { mos.close(); } } Big Data Science 25

Hive Programming Hive To provide a means of running MapReduce job through a SQL-like scripting language, called HiveQL, that can be applied towards summarization, querying, and analysis of large volumes of data. Important difference to SQL Table-generating function Lateral view Useful URLs Hive https://cwiki.apache.org/confluence/display/hive/home Language Reference https://cwiki.apache.org/confluence/display/hive/languagemanual Big Data Science 26

Workflow of Hive Hive Programming Create Table Load Data into HDFS/Hive Query Data: Use HiveQL to query data Table-generating functions User-defined operations via external programs (TRANSFORM) Lateral view Big Data Science 27

DDL Operation Creating Hive Tables HiveQL hive> CREATE TABLE pokes (foo INT, bar STRING); hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); Browsing through Tables hive> SHOW TABLES; hive> SHOW TABLES '.*s'; hive> DESCRIBE invites; Altering and Dropping Tables hive> ALTER TABLE events RENAME TO 3koobecaf; hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2'); hive> DROP TABLE pokes; Big Data Science 28

DML Operation HiveQL Loading data from flat files into Hive hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); SQL Operation SELECTS and FILTERS hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; Big Data Science 29

HiveQL GROUP BY, JOIN, STREAMING hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo; hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08- 09'; Table example for Apache Weblog data CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (- [^ ]* ]) ([^ "]* "[^ "]* ") (- [0-9]*) (- [0-9]*)(?: ([^ "]* ".* ") ([^ "]* ".* "))?" ) STORED AS TEXTFILE; Big Data Science 30

Table Generating Functions Functions generating multiple rows from one It allows a single row to expand to multiple rows Explode is one such example; it takes an array and generate a row for each item in the array (split is a function that splits a string into an array) SELECT explode(split(line, )) as word FROM a_file; Transform is a table generating function that applies an external program (just like streaming) SELECT TRANSFORM(column, ) USING command as column-alias, ; Explode(Split(.)) equivalent by transform SELECT TRANSFORM(line) USING./ws.py: from a_file; ws.py Import sys for line in sys.line: for w in line. split(): print w Big Data Science 31