Linking Structured and Unstructured Data Harnessing the Potential Raj Nair
AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases Going about it Conclusion
Un-Structured Data Name: Address: Phone: Some level of organization Some associated metadata
And others JPEG DICOM MPEG-2 BINARY FORMATS
Structured Data Label Type Limit Can be empty? Name Alphabetic 100 No Address AlphaNumeric 200 No Phone Numeric 12 Yes High Degree of Organization All associated Metadata Constraint definitions
AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases What and Why? Going about it Tools, Technologies and Architecture Conclusion
True or False? There is more structured data than there is unstructured data There is no value associated with unstructured data
Is there tangible business value? Monetization Optimization
AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases Going about it Tools, Technologies and Architecture Conclusion
Customer 360 views Patient Analysis Use Cases EHR Data with Clinical Notes Customer Churn Management Telco Anomaly/Outlier Detection
Driven by questions Customer 360 What was the response to the last campaign? Why? What offers can we target to customers? When should we offer them? Is our brand messaging in line with what customers think about them? Patient Analysis Identifying patient cohorts for a specific treatment: http://www.ncbi.nlm.nih.gov/pubmed/24384 230
It s impossible to piece together what happened without assessing all the pieces of why it happened Take medication adherence, for example. We are talking about a $300 billion problem and possibly one of the leading causes for hospital readmission. If you look at only claims data, you are going to miss a key part of the picture, i.e. why the patient is not complying. May be he or she suffers from depression or knows English only as a secondary language. These are the critical factors, and the information is already there, but we need the ability to select and use it easily, to manage our populations correctly. Kyle Silvestro, CEO, SysTrue Link to article
So why aren t we doing this already? Technology limitations Cost of Acquisition and processing Lack of Awareness Privacy
AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases What and Why? Going about it Technology, Tools and Architecture Conclusion
Data Processing Analytics Data Ingestion Data Distribution Data Integration Value Generation Visualization Data Storage
Ingesting Data Continuous streams Server logs Sensors Machine Generated Large Files DICOM image files Documents Several hundreds of gigabytes a day potentially Analyze in stream or store or BOTH
Apache Flume If you have data that streams in Instrumented machines, web servers, sensors, social media streams Apache Flume Distributed system for collecting, moving, aggregating streaming data Components: Agents Sources, Channels, Sinks Sources: Receives data from external source, writes out events to a channel Channels: Temporary holds or buffer for events till they are consumed Sinks: Destination where events are finally written to
Flume Design Redirect logs to a remote host/port Flume source converts messages to a Flume Event Flume agents hosts components through which events flow from external source to next destination Popular source types: Netcat Syslog Avro exec a1.sources.r1.type = exec #port for Flume syslog to listen on a1.sources.r1.command = tail F /<file> #host where Flume Syslog will be running on a1.sources.r1.channels = channel1 a1.sinks = sink1 a1.channels.channel1.type = memory a1.channels.channel1.capacity = 10000 a1.channels.channel1.transactioncapacity = 1000 a1.sinks.sink1.type = hdfs a1.sinks.sink1.hdfs.path = hdfs://<path>/tmp/%y-%m-%d a1.sinks.sink1.channel = channel1
Data Ingestion - Batch Copy Use Hadoop built-in file system commands WebHDFS HTTP Rest Access to HDFS HttpFS Data Integration Tools
Data Integration - RDBMS Apache Sqoop Import/Export from RDBMS Supports any JDBC-Compliant database Native connectors for MySQL, PostgreSQL Can perform incremental and merge sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username <username> - -password <password> --table CUSTOMERS -m 1 --where zipcode = 66213 --targetdir /input/customers
Data Distribution - Kafka A publish-subscribe platform re-imagined as a distributed commit log
Why that matters
Data Processing Apache Pig ETL, data cleansing, data manipulation Two major components High Level language Pig Latin Compiler that previously translated to Map Reduce Can run on Tez, Spark Data types, data flow language, user defined functions
Data Processing - Spark Build RDDs Fundamental data model for Spark4 RDDs have actions Counts, reduce, sample, loop, saveas RDDs can be transformed Gives you new RDDs Filters, unions, joins, intersections Has ML libraries
Value Generation
Data Analysis Apache Hive, Impala DW Engine for Hadoop built by Facebook Structured data with SQL(ish) query language Great for ad hoc analysis over petabytes Tool for data analysts, data scientists Word count in Hive CREATE TABLE words (line STRING); LOAD DATA INPATH hdfs:////user/hive-wc/words.txt OVERWRITE INTO TABLE words; CREATE TABLE wc AS SELECT word, count(1) AS count from (select explode(split(line, \\s )) AS word from words) w GROUP BY word ORDER BY word;
Export/Distribute to Databases RDBMS as a backend to an application Sqoop NoSQL databases (connectors) Real-time monitoring Search UIs
Scalable Distributed Architecture for Data ingestion, movement and integration Flume Agent Kafka Cluster Spark Real time Monitoring Hadoop Cluster DB Sqoop DB
Case Study1: Twitter, Server logs and CRM Customer site visit interactions Web server/ click stream (Apache Flume to stream data into HDFS) For those customers, get details What products they use/subscribe, status CRM or other databases (Apache sqoop to pull data into HDFS) Do these customers talk about us? Twitter analysis, sentiment trends (Apache Flume to stream data into HDFS) What can we do for/offer these customers? (Apache Pig, Apache Hive or other analysis engines) How can we satisfy our customers who are not happy with us? What can we offer customers who are our advocates?
a2.sources.tail-source.type = exec a2.sources.tail-source.command = tail -F /var/log/httpd-access.log a2.sources.tail-source.channels = memory-channel a2.sinks.kafka.types = org.apache.flume.sink.kafka.kafkasink Server logs Twitter Kafka Cluster a3.sources.kafka.type = org.apache.flume.source.kafka.kafkasource #port for Flume syslog to listen on a3.sources.tw.channels = MemChannel a3.sinks.hdfs.types = hdfs a3.sinks.hdfs.hdfs.path =.. a1.sources.tw.type = com.cloudera.flume.source.twittersource #port for Flume syslog to listen on a1.sources.tw.channels = MemChannel a1.sources.tw.consumerkey = a1.sources.tw.consumersecret = a1.sources.tw.accesstoken = a1.sources.tw.accesstokensecret = a1.sources.tw.keywords = brand1, product1.. a1.sinks.kafka.types = org.apache.flume.sink.kafka.kafkasink #sqoop sqoop import --connect jdbc:postgresql://pgs:5432/db_n ame --username u1 --password pw1 --table table_name --hiveimport Hadoop Cluster DB
Clean, Trim Server Logs 192.151.1.1 - - [09/May/2013:02:40:32 +0000] "GET /mysite/products/get-prod?cat=3 HTTP/1.1" 200 20274 all_logs = load 'access' using PigStorage(' '); clean1 = foreach step1 generate $0,REGEX_EXTRACT($3,'^\\[(.+)',1),REGEX_EXTRACT($6,'(cat=\\d+)(.*)',1); (192.151.1.1, 09/May/2015:02:40:32,cat=3) (192.151.1.1, 09/May/2015:02:40:32,) clean2 = filter clean1 by $2 is not null; clean3 = foreach clean2 generate $0 as id:chararray, $1 as date:chararray, REGEX_EXTRACT($2,'(\\d+)(.*)',1) as product:int; store clean3 into 'requests' using PigStorage('\t', '-schema'); (192.151.1.1, 09/May/2015:02:40:32,3)
Applying Schemas Twitter add jar json-serde-1.3-jar-with-dependencies.jar; create table tweets ( created_at string, id bigint, id_str string, text string, source string, truncated boolean, user struct < id: int, id_str: binary, name: string, screen_name: string, location: string, url: string, description: string, protected: boolean, verified: boolean, followers_count: int, friends_count: int,... entities struct < hashtags: array<struct<text:string>>, media: array< struct< id: bigint, id_str: string, indices: array<int>, geo struct < coordinates: array<float>, type: string >, retweeted_status struct < created_at: string, entities: struct < hashtags: array< struct< text: string>>, url: string>>, urls: array< struct<url: string>>, user_mentions: array< struct<name: string, screen_name: string>>>, geo: struct < coordinates: array<float>, type: string>,..
Applying Schemas - Hive CREATE EXTERNAL TABLE IF NOT EXISTS products (id INT, dateofreq STRING, product_cat INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t LOCATION /user/logs/pigoutput/ - Split Date into YEAR, MONTH etc as needed - Partition data as needed for query performance (Say by MONTH) - JOIN and reconcile with Twitter and Sqooped data
Connect Hive with ODBC clients Excel Tableau MicroStrategy Pentaho BI Talend Try out this HortonWorks tutorial: http://hortonworks.com/kb/how-to-connect-tableau-to-hortonworks-sandbox/
Case Study:Improved Patient Care - EMR, Clinical Notes, X-Rays Better identification of high risk patients Focus on targeted care Reducing the rate of re-admission More effort in building data models Create recommenders Eg : Matrix of patients and symptoms Recommend drugs when new patients enter system Recommend care plans based on history or similar patients
{ "code": "109054", "display": "Patient State", "definition": "A description of the physiological condition of the patient }, { "code": "109121", "display": "On discharge", "definition": "The occasion on which procedure was performed on discharge from hospital as an in-patient. }, { "code": "110110", "display": "Patient Record", "definition": "Audit event: Patient Record created, read, updated, or deleted" },...
EHR Data EHR PatientInfo DemoGraphics Allergies FamilyHistory CarePlan Revision 3 Revision 2 Revision 1 Revision 3 Revision 2 Revision 1 Procedures
Link, Merge, Join Generate appropriate keys Utilize existing keys as needed Overlay appropriate schemas Aim to de-normalize, join as necessary Iterate, Visualize often
AGENDA Structured and Unstructured Data What s the distinction? The rise of Unstructured Data What s driving this? Big Data Use Cases What and Why? Going about it Technology, Tools and Architecture Conclusion
Conclusion Unstructured data helps fill in the gaps Unstructured data adds deeper context Combined with structured data can generate tangible business value Data Architecture needs to be viewed with a new lens