Oracle Big Data Spatial & Graph Social Network Analysis - Case Study



Similar documents
Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

Safe Harbor Statement

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

OBIEE 11g Data Modeling Best Practices

HDP Hadoop From concept to deployment.

Comprehensive Analytics on the Hortonworks Data Platform

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Evolution of Information Management Architecture and Development

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

How Companies are! Using Spark

Oracle Big Data Essentials

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

<Insert Picture Here> Big Data

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Oracle Big Data SQL Technical Update

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Ecosystem B Y R A H I M A.

Big Data Analytics Nokia

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

The Future of Data Management

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Workshop on Hadoop with Big Data

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

What s New with Oracle BI, Analytics and DW

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

INTRODUCTION TO CASSANDRA

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Cost-Effective Business Intelligence with Red Hat and Open Source

Oracle Big Data Fundamentals Ed 1 NEW

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Native Connectivity to Big Data Sources in MSTR 10

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Regression & Load Testing BI EE 11g

Tap into Hadoop and Other No SQL Sources

GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing

OBIEE Deployment & Change Management

Introduction to Big Data Training

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Dominik Wagenknecht Accenture

Getting Started Practical Input For Your Roadmap

The Future of Data Management with Hadoop and the Enterprise Data Hub

MySQL and Hadoop Big Data Integration

Oracle BIEE and SOA Integration : Step by Step. Mark Rittman, Director, Rittman Mead Consulting

Oracle Big Data Handbook

Ganzheitliches Datenmanagement

Big Data Course Highlights

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Upcoming Announcements

Dell In-Memory Appliance for Cloudera Enterprise

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Analytics on Spark &

#TalendSandbox for Big Data

Architecting for the Internet of Things & Big Data

Testing Big data is one of the biggest

Information Builders Mission & Value Proposition

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

An Oracle White Paper June Oracle: Big Data for the Enterprise

An Oracle White Paper October Oracle: Big Data for the Enterprise

Hadoop and Map-Reduce. Swati Gore

Cloudera Certified Developer for Apache Hadoop

Oracle Database 12c Plug In. Switch On. Get SMART.

Big Data Introduction

White Paper: What You Need To Know About Hadoop

TRAINING PROGRAM ON BIGDATA/HADOOP

OWB Users, Enter The New ODI World

How To Create A Data Visualization With Apache Spark And Zeppelin

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

HDP Enabling the Modern Data Architecture

Oracle Big Data Strategy Simplified Infrastrcuture

ITG Software Engineering

How To Handle Big Data With A Data Scientist

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Qsoft Inc

BIG DATA - HADOOP PROFESSIONAL amron

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Building Your Big Data Team

An Oracle BI and EPM Development Roadmap

How To Use Big Data For Business

Big Data Too Big To Ignore

Transcription:

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 info@rittmanmead.com www.rittmanmead.com @rittmanmead

About the Speaker Mark Rittman, Co-Founder of Rittman Mead Oracle ACE Director, specialising in Oracle BI&DW 14 Years Experience with Oracle Technology Regular columnist for Oracle Magazine Author of two Oracle Press Oracle BI books Oracle Business Intelligence Developers Guide Oracle Exalytics Revealed Writer for Rittman Mead Blog : http://www.rittmanmead.com/blog Email : mark.rittman@rittmanmead.com Twitter : @markrittman info@rittmanmead.com www.rittmanmead.com @rittmanmead 3

About Rittman Mead Oracle Gold Partner with offices in the UK and USA (Atlanta) 70+ staff delivering Oracle BI, DW, Big Data and Advanced Analytics projects Oracle ACE Director (Mark Rittman, CTO) + 2 Oracle ACEs Significant web presence with the Rittman Mead Blog (http://www.rittmanmead.com) Regular sers of social media (Facebook, Twitter, Slideshare etc) Regular column in Oracle Magazine and other publications Hadoop R&D lab for dogfooding solutions developed for customers info@rittmanmead.com www.rittmanmead.com @rittmanmead 4

Business Scenario Rittman Mead want to understand drivers and audience for their website What is our most popular content? Who are the most in-demand blog authors? Who are the influencers? What communities exist around our web presence? Three data sources in scope: RM Website Logs Twitter Stream Website Posts, Comments etc info@rittmanmead.com www.rittmanmead.com @rittmanmead 5

Real-Time & Batch Log & Event Ingestion : ODI12c ODI provides an excellent framework for running Hadoop ETL jobs ELT approach pushes transformations down to Hadoop - leveraging power of cluster Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation Whilst still preserving RDBMS push-down Extensible to cover Pig, Spark etc Process orchestration Data quality / error handling Metadata and model-driven New in 12.1.3.0.1 - ability to generate Pig and Spark jobs too info@rittmanmead.com www.rittmanmead.com @rittmanmead 6

Overall Project Architecture - Phase 1 Initial iteration of project focused on capturing and ingesting web + social media activity Apache Flume used for capturing website hits, page views Twitter Streaming API used to capture tweets referring to RM website or RM staff Activity landed into Hadoop (HDFS), processed and enriched and presented using Hive info@rittmanmead.com www.rittmanmead.com @rittmanmead 7

Real-Time Metrics around Site Activity - What? Provided real-time counts of page views, correlated with Twitter activity stored in Hive tables Accessed using Oracle Big Data SQL + joined to Oracle RDBMS reference data Delivered using OBIEE reports and dashboards Data Warehousing, but cheaper + real-time Answered questions such as What are our most popular site pages? Which pages attracted the most attention on Twitter, Facebook? What topics are popular? Combine with Oracle Big Data SQL for structured OBIEE dashboard analysis What pages are people visiting? Who is referring to us on Twitter? What content has the most reach? info@rittmanmead.com www.rittmanmead.com @rittmanmead 8

Oracle BDD for Data Wrangling + Data Enrichment Oracle Big Data Discovery used to go back to the raw event data add more meaning Enrich data, extract nouns + terms, add reference data from file, RDBMS etc Understand sentiment + meaning of tweets, link disparate + loosely coupled events Faceted search dashboards info@rittmanmead.com www.rittmanmead.com @rittmanmead 9

Answered the What and Why Questions Counts of page views, tweets, mentions etc helped us understand what content was popular Analysis of tweet sentiment, meaning and correlation with content answered why Combine with Oracle Big Data SQL for structured OBIEE dashboard analysis Combine with site content, semantics, text enrichment Catalog and explore using Oracle Big Data Discovery What pages are people visiting? Who is referring to us on Twitter? What content has the most reach? Why is some content more popular? Does sentiment affect viewership? What content is popular, where? info@rittmanmead.com www.rittmanmead.com @rittmanmead 10

But Who Are The Influencers In Our Community? Previous counts assumed that all tweet references equally important But some Twitter users are far more influential than others Sit at the centre of a community, have 1000 s of followers A reference by them has massive impact on page views Positive or negative comments from them drive perception Can we identify them? Potentially reach out with analyst program Study what website posts go viral Understand out audience, and the conversation, better Find out people that are central in the given network e.g. influencer marketing Influencer Identification Communication Stream (e.g. tweets) info@rittmanmead.com www.rittmanmead.com @rittmanmead 11

What Communities and Networks Are Our Audience? Rittman Mead website features many types of content Blogs on BI, data integration, big data, data warehousing Op-Eds ( OBIEE12c - Three Months In, What s the Verdict? ) Articles on a theme, e.g. performance tuning Details of new courses, new promotions Different communities likely to form around these content types Different influencers and patterns of recommendation, discovery Can we identify some of the communities, segment our audience? Identify group of people that are close to each other e.g. target group marketing Community Detection info@rittmanmead.com www.rittmanmead.com @rittmanmead 12

Tabular (SQL) Query Tools Aimed at Counts + Aggs info@rittmanmead.com www.rittmanmead.com @rittmanmead 13

Graph Example : RM Blog Post Referenced on Twitter Follows Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Follows 0 0 0 01 23 Page Views info@rittmanmead.com www.rittmanmead.com @rittmanmead 14

Network Effect Magnified by Extent of Social Graph Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi 0 0 05 37 Page Views info@rittmanmead.com www.rittmanmead.com @rittmanmead 15

Retweets by Influential Twitter Users Drive Visits RT: Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi 0 0 3 5 Page Views Retweet Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi 0 0 0 3 Page Views info@rittmanmead.com www.rittmanmead.com @rittmanmead 16

Retweets, Mentions and Replies Create Communities Retweet Mention Mention Mention Reply Reply Mention Mention Reply #bigdatasql #thatswhatshesaid info@rittmanmead.com www.rittmanmead.com @rittmanmead 17

Property Graph Terminology Node, or Vertex Vertex Properties Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Edge Type Directed Connection, or Edge Mentions Node, or Vertex info@rittmanmead.com www.rittmanmead.com @rittmanmead 18

Property Graph Terminology Directed Connection, or Edge Retweets Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Node, or Vertex Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Node, or Vertex Mentions info@rittmanmead.com www.rittmanmead.com @rittmanmead 19

Determining Influencers - Factors to Consider Different types of Twitter interaction could imply more or less influence Retweet of another user s Tweet implies that person is worth quoting or you endorse their opinion Reply to another user s tweet could be a weaker recognition of that person s opinion or view Mention of a user in a tweet is a weaker recognition that they are part of a community / debate info@rittmanmead.com www.rittmanmead.com @rittmanmead 20

Relative Importance of Edge Types Added via Weights Edge Property Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Retweet, Weight = 100 Lifting the Lid on OBIEE Internals with Linux Diagnostics Tools http://t.co/gfcupom5pi Edge Property Mentions, Weight = 30 info@rittmanmead.com www.rittmanmead.com @rittmanmead 21

Oracle Big Data Spatial & Graph Graph, spatial and raster data processing for big data Primarily documented + tested against Oracle BDA Installable on commodity cluster using CDH Data stored in Apache HBase or Oracle NoSQL DB Complements Spatial & Graph in Oracle Database Designed for trillions of nodes, edges etc Out-of-the-box spatial enrichment services Over 35 of most popular graph analysis functions Graph traversal, recommendations Finding communities and influencers, Pattern matching info@rittmanmead.com www.rittmanmead.com @rittmanmead 22

Oracle Big Data Graph and Spatial Architecture Data loaded from files or through Java API into HBase In-Memory Analytics layer runs common graph and spatial algorithms on data Visualised using R or other graphics packaged Lightning-Fast In-Memory Analytics YARN Container Standalone Server Embedded Massively Scalable Graph Store Oracle NoSQL HBase info@rittmanmead.com www.rittmanmead.com @rittmanmead

Preparing Vertices and Edges for Ingestion ODI12c used to prepare two files in Oracle Flat File Format Extracted vertices and edges from existing data in Hive Wrote vertices (Twitter users) to.opv file, edges (RTs, replies etc) to.ope file For exercise, only considered 2-3 days of tweets Did not include follows (user A followed user B) as not reported by Twitter Streaming API Could approximate larger follower networks through multiplying weight of edge by follower scale -Useful for Page Rank, but does it skew actual detection of influencers in exercise? info@rittmanmead.com www.rittmanmead.com @rittmanmead 24

Oracle Flat File Format Vertices and Edge Files Vertex File (.opv) Edge File (.ope) Unique ID for the vertex Property name ( name ) Property value datatype (1 = String) Property value ( markrittman ) Unique ID for the edge Leading edge vertex ID Trailing edge vertex ID Edge Type ( mentions ) Edge Property ( weight ) Edge Property datatype and value info@rittmanmead.com www.rittmanmead.com @rittmanmead 25

Loading Edges and Vertices into HBase cfg = GraphConfigBuilder.forPropertyGraphHbase() \.setname("connectionshbase") \.setzkquorum("bigdatalite").setzkclientport(2181) \.setzksessiontimeout(120000).setinitialedgenumregions(3) \.setinitialvertexnumregions(3).setsplitsperregion(1) \.addedgeproperty("weight", PropertyType.DOUBLE, "1000000") \.build(); opg = OraclePropertyGraph.getInstance(cfg); opg.clearrepository(); vfile="../../data/biwa_connections.opv" efile="../../data/biwa_connections.ope" opgdl=oraclepropertygraphdataloader.getinstance(); opgdl.loaddata(opg, vfile, efile, 2); // read through the vertices opg.getvertices(); // read through the edges opg.getedges(); Uses Gremlin Shell for HBase Creates connection to HBase Sets initial configuration for database Builds the database ready for load Defines location of Vertex and Edge files Creates instance of OraclePropertyGraphDataLoader Loads data from files Prepares the property graph for use Loads in Edges and Vertices Now ready for in-memory processing info@rittmanmead.com www.rittmanmead.com @rittmanmead 26

Calculating Most Influential Tweeters Using Page Rank voutput="/tmp/mygraph.opv" eoutput="/tmp/mygraph.ope" OraclePropertyGraphUtils.exportFlatFiles(opg, voutput, eoutput, 2, false); session = Pgx.createSession("session-id-1"); analyst = session.createanalyst(); graph = session.readgraphwithproperties(opg.getconfig()); rank = analyst.pagerank(graph, 0.001, 0.85, 100); rank.gettopkvalues(5); Initiates an in-memory analytics session Runs Page Rank algorithm to determine influencers Outputs top ten vertices (users) Top 10 vertices ==>PgxVertex with ID 1=0.13885623487462861 ==>PgxVertex with ID 3=0.08686102641801993 ==>PgxVertex with ID 101=0.06757752513733056 ==>PgxVertex with ID 6=0.06743774001139484 ==>PgxVertex with ID 37=0.0481517609757462 ==>PgxVertex with ID 17=0.042234536894569276 ==>PgxVertex with ID 29=0.04109794527311113 ==>PgxVertex with ID 65=0.032058649698044187 ==>PgxVertex with ID 15=0.023075360575195276 ==>PgxVertex with ID 93=0.019265959946506813 info@rittmanmead.com www.rittmanmead.com @rittmanmead 27

Calculating Most Influential Tweeters Using Page Rank v1=opg.getvertex(1l); v2=opg.getvertex(3l); v3=opg.getvertex(101l); \ v4=opg.getvertex(6l); v5=opg.getvertex(37l); v6=opg.getvertex(17l); \ v7=opg.getvertex(29l); v8=opg.getvertex(65l); v9=opg.getvertex(15l); \ v10=opg.getvertex(93l); System.out.println("Top 10 influencers: \n " + v1.getproperty("name") + \ "\n " + v2.getproperty("name") + \ "\n " + v3.getproperty("name") + \ "\n " + v4.getproperty("name") + \ "\n " + v5.getproperty("name") + \ "\n " + v6.getproperty("name") + \ "\n " + v7.getproperty("name") + \ "\n " + v8.getproperty("name") + \ "\n " + v9.getproperty("name") + \ "\n " + v10.getproperty("name")); Note : Over a 3-day period in May 2015 Twitter users referencing RM website + staff accounts Top 10 influencers: markrittman rmoff rittmanmead mrainey JeromeFr Nephentur borkur BIExperte i_m_dave dw_pete info@rittmanmead.com www.rittmanmead.com @rittmanmead 28

Visualising Property Graphs with Cityscape Open source graph analysis tool with Oracle Big Data Graph and Spatial Plug-in Available shortly from Oracle, connects to Oracle NoSQL or HBase and runs Page Rank etc Alternative to command-line for In-Memory Analytics once base graph created info@rittmanmead.com www.rittmanmead.com @rittmanmead 29

Calculating Top 10 Users using Page Rank Algorithm info@rittmanmead.com www.rittmanmead.com @rittmanmead 30

Visualising the Social Graph Around Particular Users info@rittmanmead.com www.rittmanmead.com @rittmanmead 31

Detecting Clusters (Communities) info@rittmanmead.com www.rittmanmead.com @rittmanmead 32

Calculating Shortest Path Between Users info@rittmanmead.com www.rittmanmead.com @rittmanmead 33

Conclusions, and Further Reading Tools such as OBIEE are great for understanding what (counts, page views, popular items) Oracle Big Data Discovery can be useful for understanding why? (sentiment, terms etc) Graph Analysis can help answer who? Who are our audience? What are our communities? Who are their important influencers? Oracle Big Data Graph and Spatial can answer these questions to big data scale Articles on the Rittman Mead Blog http://www.rittmanmead.com/category/oracle-big-data-appliance/ http://www.rittmanmead.com/category/big-data/ http://www.rittmanmead.com/category/oracle-big-data-discovery/ Rittman Mead offer consulting, training and managed services for Oracle Big Data http://www.rittmanmead.com/bigdata info@rittmanmead.com www.rittmanmead.com @rittmanmead 34

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 info@rittmanmead.com www.rittmanmead.com @rittmanmead