Harnessing the Potential Raj Nair



Similar documents
Hadoop Ecosystem B Y R A H I M A.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Real World Hadoop Use Cases

The Future of Data Management

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Comprehensive Analytics on the Hortonworks Data Platform

The Future of Data Management with Hadoop and the Enterprise Data Hub

How To Scale Out Of A Nosql Database

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Upcoming Announcements

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Dominik Wagenknecht Accenture

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Hadoop Job Oriented Training Agenda

Moving From Hadoop to Spark

#TalendSandbox for Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HDP Hadoop From concept to deployment.

Putting Apache Kafka to Use!

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

HADOOP. Revised 10/19/2015

Self-service BI for big data applications using Apache Drill

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Native Connectivity to Big Data Sources in MSTR 10

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

HDP Enabling the Modern Data Architecture

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data? Definition # 1: Big Data Definition Forrester Research

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Using Tableau Software with Hortonworks Data Platform

Certified Big Data and Apache Hadoop Developer VS-1221

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Modern Data Architecture for Predictive Analytics

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Self-service BI for big data applications using Apache Drill

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Integrating VoltDB with Hadoop

Talend Big Data. Delivering instant value from all your data. Talend

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Tap into Hadoop and Other No SQL Sources

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Data processing goes big

Information Builders Mission & Value Proposition

SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Data Challenges in Telecommunications Networks and a Big Data Solution

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

IBM Big Data Platform

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Constructing a Data Lake: Hadoop and Oracle Database United!

ITG Software Engineering

Introducing the Reimagined Power BI Platform. Jen Underwood, Microsoft

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Harnessing big data with Hortonworks Data Platform and Red Hat JBoss Data Virtualization

the missing log collector Treasure Data, Inc. Muga Nishizawa

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

SAP and Hortonworks Reference Architecture

Big Data Too Big To Ignore

MySQL and Hadoop. Percona Live 2014 Chris Schneider

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Chase Wu New Jersey Ins0tute of Technology

Big Data Analytics Nokia

Workshop on Hadoop with Big Data

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Big Data Weather Analytics Using Hadoop

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

RapidMiner Radoop Documentation

Big Data Course Highlights

Cloudera Enterprise Data Hub in Telecom:

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

BIG DATA CHALLENGES AND PERSPECTIVES

Big Data With Hadoop

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Transcription:

Linking Structured and Unstructured Data Harnessing the Potential Raj Nair

AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases Going about it Conclusion

Un-Structured Data Name: Address: Phone: Some level of organization Some associated metadata

And others JPEG DICOM MPEG-2 BINARY FORMATS

Structured Data Label Type Limit Can be empty? Name Alphabetic 100 No Address AlphaNumeric 200 No Phone Numeric 12 Yes High Degree of Organization All associated Metadata Constraint definitions

AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases What and Why? Going about it Tools, Technologies and Architecture Conclusion

True or False? There is more structured data than there is unstructured data There is no value associated with unstructured data

Is there tangible business value? Monetization Optimization

AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases Going about it Tools, Technologies and Architecture Conclusion

Customer 360 views Patient Analysis Use Cases EHR Data with Clinical Notes Customer Churn Management Telco Anomaly/Outlier Detection

Driven by questions Customer 360 What was the response to the last campaign? Why? What offers can we target to customers? When should we offer them? Is our brand messaging in line with what customers think about them? Patient Analysis Identifying patient cohorts for a specific treatment: http://www.ncbi.nlm.nih.gov/pubmed/24384 230

It s impossible to piece together what happened without assessing all the pieces of why it happened Take medication adherence, for example. We are talking about a $300 billion problem and possibly one of the leading causes for hospital readmission. If you look at only claims data, you are going to miss a key part of the picture, i.e. why the patient is not complying. May be he or she suffers from depression or knows English only as a secondary language. These are the critical factors, and the information is already there, but we need the ability to select and use it easily, to manage our populations correctly. Kyle Silvestro, CEO, SysTrue Link to article

So why aren t we doing this already? Technology limitations Cost of Acquisition and processing Lack of Awareness Privacy

AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases What and Why? Going about it Technology, Tools and Architecture Conclusion

Data Processing Analytics Data Ingestion Data Distribution Data Integration Value Generation Visualization Data Storage

Ingesting Data Continuous streams Server logs Sensors Machine Generated Large Files DICOM image files Documents Several hundreds of gigabytes a day potentially Analyze in stream or store or BOTH

Apache Flume If you have data that streams in Instrumented machines, web servers, sensors, social media streams Apache Flume Distributed system for collecting, moving, aggregating streaming data Components: Agents Sources, Channels, Sinks Sources: Receives data from external source, writes out events to a channel Channels: Temporary holds or buffer for events till they are consumed Sinks: Destination where events are finally written to

Flume Design Redirect logs to a remote host/port Flume source converts messages to a Flume Event Flume agents hosts components through which events flow from external source to next destination Popular source types: Netcat Syslog Avro exec a1.sources.r1.type = exec #port for Flume syslog to listen on a1.sources.r1.command = tail F /<file> #host where Flume Syslog will be running on a1.sources.r1.channels = channel1 a1.sinks = sink1 a1.channels.channel1.type = memory a1.channels.channel1.capacity = 10000 a1.channels.channel1.transactioncapacity = 1000 a1.sinks.sink1.type = hdfs a1.sinks.sink1.hdfs.path = hdfs://<path>/tmp/%y-%m-%d a1.sinks.sink1.channel = channel1

Data Ingestion - Batch Copy Use Hadoop built-in file system commands WebHDFS HTTP Rest Access to HDFS HttpFS Data Integration Tools

Data Integration - RDBMS Apache Sqoop Import/Export from RDBMS Supports any JDBC-Compliant database Native connectors for MySQL, PostgreSQL Can perform incremental and merge sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username <username> - -password <password> --table CUSTOMERS -m 1 --where zipcode = 66213 --targetdir /input/customers

Data Distribution - Kafka A publish-subscribe platform re-imagined as a distributed commit log

Why that matters

Data Processing Apache Pig ETL, data cleansing, data manipulation Two major components High Level language Pig Latin Compiler that previously translated to Map Reduce Can run on Tez, Spark Data types, data flow language, user defined functions

Data Processing - Spark Build RDDs Fundamental data model for Spark4 RDDs have actions Counts, reduce, sample, loop, saveas RDDs can be transformed Gives you new RDDs Filters, unions, joins, intersections Has ML libraries

Value Generation

Data Analysis Apache Hive, Impala DW Engine for Hadoop built by Facebook Structured data with SQL(ish) query language Great for ad hoc analysis over petabytes Tool for data analysts, data scientists Word count in Hive CREATE TABLE words (line STRING); LOAD DATA INPATH hdfs:////user/hive-wc/words.txt OVERWRITE INTO TABLE words; CREATE TABLE wc AS SELECT word, count(1) AS count from (select explode(split(line, \\s )) AS word from words) w GROUP BY word ORDER BY word;

Export/Distribute to Databases RDBMS as a backend to an application Sqoop NoSQL databases (connectors) Real-time monitoring Search UIs

Scalable Distributed Architecture for Data ingestion, movement and integration Flume Agent Kafka Cluster Spark Real time Monitoring Hadoop Cluster DB Sqoop DB

Case Study1: Twitter, Server logs and CRM Customer site visit interactions Web server/ click stream (Apache Flume to stream data into HDFS) For those customers, get details What products they use/subscribe, status CRM or other databases (Apache sqoop to pull data into HDFS) Do these customers talk about us? Twitter analysis, sentiment trends (Apache Flume to stream data into HDFS) What can we do for/offer these customers? (Apache Pig, Apache Hive or other analysis engines) How can we satisfy our customers who are not happy with us? What can we offer customers who are our advocates?

a2.sources.tail-source.type = exec a2.sources.tail-source.command = tail -F /var/log/httpd-access.log a2.sources.tail-source.channels = memory-channel a2.sinks.kafka.types = org.apache.flume.sink.kafka.kafkasink Server logs Twitter Kafka Cluster a3.sources.kafka.type = org.apache.flume.source.kafka.kafkasource #port for Flume syslog to listen on a3.sources.tw.channels = MemChannel a3.sinks.hdfs.types = hdfs a3.sinks.hdfs.hdfs.path =.. a1.sources.tw.type = com.cloudera.flume.source.twittersource #port for Flume syslog to listen on a1.sources.tw.channels = MemChannel a1.sources.tw.consumerkey = a1.sources.tw.consumersecret = a1.sources.tw.accesstoken = a1.sources.tw.accesstokensecret = a1.sources.tw.keywords = brand1, product1.. a1.sinks.kafka.types = org.apache.flume.sink.kafka.kafkasink #sqoop sqoop import --connect jdbc:postgresql://pgs:5432/db_n ame --username u1 --password pw1 --table table_name --hiveimport Hadoop Cluster DB

Clean, Trim Server Logs 192.151.1.1 - - [09/May/2013:02:40:32 +0000] "GET /mysite/products/get-prod?cat=3 HTTP/1.1" 200 20274 all_logs = load 'access' using PigStorage(' '); clean1 = foreach step1 generate $0,REGEX_EXTRACT($3,'^\\[(.+)',1),REGEX_EXTRACT($6,'(cat=\\d+)(.*)',1); (192.151.1.1, 09/May/2015:02:40:32,cat=3) (192.151.1.1, 09/May/2015:02:40:32,) clean2 = filter clean1 by $2 is not null; clean3 = foreach clean2 generate $0 as id:chararray, $1 as date:chararray, REGEX_EXTRACT($2,'(\\d+)(.*)',1) as product:int; store clean3 into 'requests' using PigStorage('\t', '-schema'); (192.151.1.1, 09/May/2015:02:40:32,3)

Applying Schemas Twitter add jar json-serde-1.3-jar-with-dependencies.jar; create table tweets ( created_at string, id bigint, id_str string, text string, source string, truncated boolean, user struct < id: int, id_str: binary, name: string, screen_name: string, location: string, url: string, description: string, protected: boolean, verified: boolean, followers_count: int, friends_count: int,... entities struct < hashtags: array<struct<text:string>>, media: array< struct< id: bigint, id_str: string, indices: array<int>, geo struct < coordinates: array<float>, type: string >, retweeted_status struct < created_at: string, entities: struct < hashtags: array< struct< text: string>>, url: string>>, urls: array< struct<url: string>>, user_mentions: array< struct<name: string, screen_name: string>>>, geo: struct < coordinates: array<float>, type: string>,..

Applying Schemas - Hive CREATE EXTERNAL TABLE IF NOT EXISTS products (id INT, dateofreq STRING, product_cat INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t LOCATION /user/logs/pigoutput/ - Split Date into YEAR, MONTH etc as needed - Partition data as needed for query performance (Say by MONTH) - JOIN and reconcile with Twitter and Sqooped data

Connect Hive with ODBC clients Excel Tableau MicroStrategy Pentaho BI Talend Try out this HortonWorks tutorial: http://hortonworks.com/kb/how-to-connect-tableau-to-hortonworks-sandbox/

Case Study:Improved Patient Care - EMR, Clinical Notes, X-Rays Better identification of high risk patients Focus on targeted care Reducing the rate of re-admission More effort in building data models Create recommenders Eg : Matrix of patients and symptoms Recommend drugs when new patients enter system Recommend care plans based on history or similar patients

{ "code": "109054", "display": "Patient State", "definition": "A description of the physiological condition of the patient }, { "code": "109121", "display": "On discharge", "definition": "The occasion on which procedure was performed on discharge from hospital as an in-patient. }, { "code": "110110", "display": "Patient Record", "definition": "Audit event: Patient Record created, read, updated, or deleted" },...

EHR Data EHR PatientInfo DemoGraphics Allergies FamilyHistory CarePlan Revision 3 Revision 2 Revision 1 Revision 3 Revision 2 Revision 1 Procedures

Link, Merge, Join Generate appropriate keys Utilize existing keys as needed Overlay appropriate schemas Aim to de-normalize, join as necessary Iterate, Visualize often

AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases What and Why? Going about it Technology, Tools and Architecture Conclusion

Conclusion Unstructured data helps fill in the gaps Unstructured data adds deeper context Combined with structured data can generate tangible business value Data Architecture needs to be viewed with a new lens