Big Data Analytics Opportunities and Challenges



Similar documents
Big Data Analytics and Healthcare

Big Data Analytics for Healthcare

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Deploying Hadoop with Manager

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Hadoop Ecosystem B Y R A H I M A.

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Architecture. Part 1

Chapter 7. Using Hadoop Cluster and MapReduce

The Data Mining Process

BIG DATA What it is and how to use?

Mammoth Scale Machine Learning!

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Qsoft Inc

Transforming the Telecoms Business using Big Data and Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data Course Highlights

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 10 - Functional programming: Hadoop and MapReduce

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Complete Java Classes Hadoop Syllabus Contact No:

HadoopRDF : A Scalable RDF Data Analysis System

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Workshop on Hadoop with Big Data

Hadoop Job Oriented Training Agenda

Apache Hadoop. Alexandru Costan

Advanced In-Database Analytics

Implement Hadoop jobs to extract business value from large and varied data sets

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

A Brief Outline on Bigdata Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Internals of Hadoop Application Framework and Distributed File System

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Large scale processing using Hadoop. Ján Vaňo

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Advanced Big Data Analytics with R and Hadoop

A very short Intro to Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Data Mining Algorithms Part 1. Dejan Sarka

Data processing goes big

Big Data and Data Science: Behind the Buzz Words

COURSE CONTENT Big Data and Hadoop Training

How To Use Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Open source Google-style large scale data analysis with Hadoop

Microsoft SQL Server 2012 with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

HADOOP MOCK TEST HADOOP MOCK TEST II

How to Hadoop Without the Worry: Protecting Big Data at Scale

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

NoSQL and Hadoop Technologies On Oracle Cloud

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Exploration and Visualization of Post-Market Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Oracle Big Data Fundamentals Ed 1 NEW

Certified Big Data and Apache Hadoop Developer VS-1221

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

L1: Introduction to Hadoop

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Big Data Introduction

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

International Journal of Innovative Research in Computer and Communication Engineering

Hadoop Parallel Data Processing

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

BIG DATA - HADOOP PROFESSIONAL amron

Foundations of Business Intelligence: Databases and Information Management

Introduction to Cloud Computing

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data on Microsoft Platform

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Data Mining in the Swamp

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Transcription:

Big Data Analytics Opportunities and Challenges Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 2

Big Data Applications Advertising and marketing Customer shopping patterns Response to promotional campaign Manufacturing Maintenance of machine health Social Media Browsing and sentiment analysis Impact on buying patterns Government data Efficient process management 3

Big Data Applications (cont d) Stock Market Stock performance prediction Healthcare Management Patient health monitoring Impact of preventive care Financial Institutions Fraud detection and mitigation Weather Prediction Impact analysis and better disaster management 4

Big Data Analytics: Benefits Saving money Fraud detection in medical industry Risk Management Lower cost and better outcomes Real time decision making Sensor data analysis Influence of social sentiment on patient health Getting to know your customer better Targeted advertisement Better recommendations 5

Components of Big Data Volume Data generated by 2020 will be in Zettabytes (10 21 ) Soon after that the measure will be Yottabyte (10 24 ) and Brontabyte (10 27 ) Variety Structured (transactional data) Unstructured (image, video and text data) Velocity Rate of data generation Increase in the ability to process data Variability/Veracity Trustworthiness of data Quality of data 6

Key Components of Big Data Analytics Understanding Advanced Analytics Association analysis Clustering, Classification, Regression Recommendation framework Time series analysis Handling volume and velocity of data Map/Reduce framework Cloud computing framework Custom frameworks Dealing with Variety of Data Structured Data Analytics (Advanced Analytics) Unstructured Data Analytics (Text Processing) 7

Data Analysis Options Descriptive analytics : Specifies the data characteristics How to describe the system? What happened in the system and when? What are the parameters in the system? What is the impact of a parameter on the system? Is there any correlation between the parameters? Predictive analytics: uses data mining and predictive modeling It can answer the following questions What are the future trends? What is the decision based on past history? Perform what if analysis 8

Handling Large Volume Hadoop Based Implementation of Analytics In Database Analytics 9

Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides a standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Disadvantages Complex implementation 10

In Database Analytics Allows analytic computation to be carried out in the database Uses: SQL SQL extensions Advantages Higher analytics efficiency, easy usability, better database manageability Analytic centralization may allow easy security, data, and version management Disadvantages Vendor dependent Cost 11

Handling of Variety of Data Structured Data Analytics Association analysis Clustering Classification Regression Recommendation framework Time series analysis Unstructured Data Analytics Association analysis Clustering Classification Text processing 12

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 13

Introduction to Hadoop Open source from Apache Free download from http://hadoop.apache.org Runs on commodity hardware Lowers procurement, licensing, and operational costs Scalable Incrementally add hardware nodes as needed Can accommodate growth in data and processing requirements Fault tolerant Automatic data replication and task reallocation 14

Core Components of Hadoop Big Data Architecture Integration Data Processing (MapReduce) Security Operations Data Storage (HDFS) 15

Hadoop Distributed File System (HDFS) Designed for large scale data processing Allows storage of large data files Runs on top of native file system on each node Data is stored in units known as blocks 64MB (default) although normally larger Large files are stored in many blocks over many nodes Blocks are replicated to multiple nodes 16

HDFS Blocks and Replication File Block 1 Block 2 Block 3 Node 1 Block 1 Node 2 Block 1 Node 3 Block 3 Block 2 Block 3 Block 2 17

Hadoop Node Types in a Cluster Nodes in Hadoop have different roles to play HDFS nodes NameNodes DataNodes MapReduce nodes JobTracker TaskTracker 18

Hadoop Node Interaction Start Hadoop Cluster JobTracker MapReduce TaskTracker TaskTracker HDFS NameNode DataNode DataNode DataNode DataNode 19

Role of JobTracker Receives jobs from clients and manages jobs Master in master slave architecture with TaskTrackers Determines job execution plan Assigns nodes to different processing tasks Monitors tasks as they execute Detects failures and restarts/re runs tasks 20

Role of TaskTracker Manages individual tasks that the JobTracker issues Can be map or reduce jobs Single TaskTracker can run many map or reduce tasks in parallel Keeps JobTracker updated with task progress using heartbeat signal Failure to send heartbeat results in JobTracker resubmitting task To another TaskTracker node 21

Role of NameNode HDFS is also a master slave architecture NameNode is the HDFS master directing DataNode slave nodes DataNodes perform low level input/output This is where data is stored Keeps track of files How they are broken into blocks Which nodes blocks are stored on Only one per Hadoop cluster! Single point of failure, so should not use commodity hardware 22

Role of DataNode Performs low level work reading and writing HDFS blocks to local file system Communicates with: NameNode Provides information about which blocks they are storing DataNodes to replicate data blocks Many per Hadoop cluster Normally run on inexpensive commodity hardware 23

Hadoop Execution Modes Hadoop can be configured to run in one of three modes Local (standalone) mode Pseudo distributed mode Fully distributed mode Four key configuration files define which mode to run in core-site.xml hdfs-site.xml mapred-site.xml 24

Local (Standalone) Mode This is the default mode for Hadoop No assumptions are made about hardware Does not use HDFS Used for developing and debugging application logic of Hadoop programs 25

Pseudo Distributed Mode Hadoop runs in clustered mode Cluster size is one Useful for developing and debugging Hadoop applications Enables examination of: Memory and HDFS usage Configuration files used to set port numbers used for NameNode and JobTracker 26

Fully Distributed Mode This is the mode for running Hadoop applications in production Configuration requires setup of Master nodes Host of NameNode and JobTracker components Slave nodes Machines running DataNode and TaskTracker components NameNode hosts a report on HDFS status on port number 50070 Enables checking of each DataNode in cluster JobTracker provides status report on MapReduce jobs on port 50030 Reports about ongoing jobs 27

Introduction to Pig Hadoop is powerful for processing large datasets Requires significant programming skills to develop applications Pig is a Hadoop extension that simplifies Hadoop programming It is Easy to use It is Higly scalable There are two major components of Pig A data processing language called Pig Latin A compiler to generate Pig Latin to Hadoop programs 28

Introduction to Hive A data warehousing package built on top of Hadoop Originally developed at Facebook Target audience for Hive is data analysts comfortable with SQL Provides access to ad hoc queries and data analysis On large scale datasets Focuses on structured data Enables optimizations to be performed Provides SQL like language HiveQL Based on tables, rows, columns No need to know about Hadoop programming 29

Working With Hive Hive requires a metastore service Provides schemas to structure the data Helps with efficient querying and processing Implemented using tables in a relational database By default, Hive uses Java s built in Derby database Hive can be accessed from Command line interface Web interface known as Hive web Interface 30

Introduction to Apache Sentry Features Secure authorization Ability to control and enforce access to data Grant privileges on data for authenticated users. Fine grained access control Control at the database, table, and view level As an example of fine grained control One group can see all data Another group can see only non sensitive data filtered by a view Sentry provides unified security platform: Existing Hadoop Kerberos security for authentication Sentry access policy applied to all applications e.g. Pig, Hive etc 31

Introduction to Oozie Oozie is a Hadoop workflow engine Manages data processing activities Available from http://oozie.apache.org Simplifies the management of Recurring and data driven jobs Re execution of failed jobs Comprised of A workflow engine Can execute different types of jobs A coordinator engine 32

Oozie Workflows Oozie workflows are directed acyclic graphs (DAGs) Graph nodes are either Action nodes Perform a workflow task Control flow nodes Determine the logic between tasks There is always exactly one start node and at least one terminal node start action action terminal terminal 33

Pig, Hive, and Impala in Big Data Architecture Integration Pig Hive Impala Mahout Security Operations Map Reduce Sqoop HDFS Sentry Flume Oozie 34

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 35

Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Provides flexible and adaptable architecture for implementation Disadvantages Complex implementation 36

MapReduce Features Allows use of large computational resources Inter cluster communication is managed by MapReduce MapReduce architecture supports Data and task distribution Fault monitoring Task and data replication Simple programming model Limitations of MapReduce Cannot solve all the problems 37

MapReduce: A Pragmatic Approach It can solve many Big Data problems Data filtering Statistics and aggregation Graph analytics Decision Tree and classification Clustering and recommendations Practical Distributed API Easier to understand and use Higher level APIs exist To reduce the complexity of programming Ability to schedule multi stage jobs 38

Phases in MapReduce Processing Data Processing with MapReduce goes through three phases Map Phase Processes the data and generate <key, Value> pairs Shuffle Phase Moves the data <key, value> pairs to appropriate processing node for reduction Reduce Phase Processes the data <key, value> pairs to generate final output Hadoop can use multiple machines for each phase 39

Association Rule Example In order to compute support The number of times each product and its combinations occur in the data has to be calculated The original transaction file format 1, milk (M), bread (B) 2, bread (B), butter (T) 3, milk (M), bread (B), butter(t) 4, milk(m) The input data file would be: M B MB B T BT M B T MB MT BT MBT M 40

Association Rule MapReduce Input Splitting Mapping Shuffling Reducing M B MB B T BT M B T MB MT BT MBT M M B MB B T BT M B T MB MT BT MBT M,1 B,1 MB, 1 B,1 T, 1 BT,1 M, 1 B,1 T, 1 MB,1 BT,1 MT,1 MBT,1 M,1 B,1 M,1 B,1 M,1 B,1 T,1 T,1 MB,1 MB,1 BT,1 BT,1 MT,1 MT,1 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M M,1 41

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 42

Steps in Structured Data Analytics Step 1: Business domain analysis Step 2: Data exploration and investigation Step 3: Data preparation and cleaning Step 4: Model design and development Step 5: Model verification and testing Step 6: Analyze the output 43

Descriptive and Predictive Analytics Descriptive analytics provides knowledge discovered from the data set Examples include Clustering Correlation analysis Pattern discovery Association rules Predictive analytics builds a model to predict future behavior Examples include Regression models Time series analysis 44

Introduction to Mahout Open source machine learning library Leveraging the power of MapReduce Hadoop Currently available algorithms are Various clustering algorithms Singular value decomposition Parallel Frequent Pattern mining Naive Bayes classifier Random forest decision tree based classifier Collaborative filtering User and item based recommenders Others 45

Mahout Architecture Application Application Application Application Application Mahout: Machine Learning Library Clustering, Classification, Recommendation, Filtering, etc. Hadoop/MapReduce Storage Storage Storage Storage Compute Compute Compute Compute Storage Storage Storage Storage Compute Compute Compute Compute 46

Mahout Machine Learning Library Interaction Command line interaction Allows execution of machine learning algorithms using Mahout commands Pros: Easy use of algorithms Cons: Have to specify large number of options; fixed number of options are available Application Program Interface based interaction Allows execution of machine learning algorithms in the source code Pros: More flexibility is available in using the algorithms Cons: Have to know programming 47

Clustering: Introduction Clustering is the grouping of available data based on similarities on some pre specified criteria Uses unsupervised learning techniques Uses statistical methods A way of looking for patterns or structure in the data that are of interest Allows partitioning of data in different groups Identifies a set of patterns and structures Can be used as a standalone technique to gain insight into data distribution 48

Clustering in Mahout Two core components of clustering are An algorithm to group items in clusters Kmeans clustering Hierarchical clustering Many others Concept of similarity and dissimilarity This specifies how close various data locations are 49

Business Applications for Clustering Market research Segmentation Targeted advertisements Customer categorization Government management Crime hot spot analysis Census data for economic and social analysis Inventory management Location based product stocking 50

Basic Steps in Clustering Select features from the data sets Choose clustering approach Select parameters for clustering 51

/bin/mahout kmeans -i input-file -o output -c clusters -dm org.apache.mahout.common.distance.cosinedistancemeasure -x 5 -ow -cd 1 -k 25 Executing Mahout Clustering Algorithm 52

Association Rules: Introduction Association rule (AR) mining involves finding one of the following from among a set of data Frequent patterns Associations Correlation Causal structures AR is used to find interesting relationships between variables in transactionoriented data sets Association rules are of the form {X} > {Y } with [support, confidence, lift] 53

Business Applications of Association Rules Cross selling and up selling Market basket analysis Catalog design Keeping the related items together Planning and monitoring Network design based on web logs Planning web design using web usage log mining In general, any decision based on knowledge of previous transactions by sets of users 54

Introduction to Classification Classification separates data into various categories Comprises assigning a class label to a set of unclassified data sets The classification of an unknown observation to a class is based on the model Thus, classification is categorized as supervised learning Classification is a two step process Model training and creation Based on training data set where the observations have already been classified Model testing Apply the model to test data to classify them in appropriate categories 55

Classification Terminology Training examples Complete set of inputs for model development and testing Training data Subset of training examples used for building the model Model Set of rules generated after training the classifier Test data Remaining examples from training examples not used in training data to study how effective the model was Predictor variables Set of variables that are used to decide the output for target variable Target variable This is the variable the classifier is estimating, based on predictor variables 56

The Structure of Input for Mahout Classifier The general structure for input for training is TV, PV 1, PV 2, PV 3,..,PV n The general structure for test input is??, PV 1, PV 2, PV 3,..,PV n?? refers to the target variable that needs to be evaluated based on predictor variable TV, PV 1, PV 2, PV 3,..,PV n TV, PV 1, PV 2, PV 3,..,PV n TV, PV 1, PV 2, PV 3,..,PV n Classifier model generation??, PV 1, PV 2, PV 3,..,PV n??, PV PV = predictor variable 1, PV 2, PV 3,..,PV n TV = target variable??, PV 1, PV 2, PV 3,..,PV n TV Classifier generated model TV TV 57

Business Applications of Classification Loan application processing Categorization of customer types Identification of product classes Classification of a privacy policy as safe or unsafe Medical diagnosis Targeted marketing 58

Classification Modeling Techniques Decision tree Random forest Many others 59

Decision Tree Modeling This provides tree structure with Internal nodes representing a test on a certain attribute The result of the test is represented by the branch at internal node A class is represented by a leaf node Starts with the root node and splits into multiple branches May further split into new nodes Ends in leaf nodes that contain the decisions Model can be in the form of a set of rules or a decision tree Rules specify the decision making process of the model Uses recursive partitioning approach (divide and conquer) The process stops when partitioning does not improve the outcome 60

Algorithms Supported by Mahout for Clustering Stochastic Gradient Descent (SGD) Can run in sequential, online, and incremental mode Data set of less than tens of millions is suitable for processing Naive Bayes Can run in parallel mode Data set of millions to hundreds of millions is suitable for processing Random forest Can run in parallel mode Data set of less than tens of millions is suitable for processing 61

Command for Training the SGD Classifier Training: bin/mahout trainlogistic [options] Major Options --input <file> --output <file> --target <variable> --categories <n> --predictors <pv1>.<pvn> --types <tpv1>,<tpvn> --passes --rate --quiet Input file Output file Target variable Possibilities in TV Predictor variable Types of PVs Explanation Number of times input should be used Specify initial learning rate Generate less details of execution 62

Command for Evaluating the SGD Classifier Evaluating: bin/mahout runlogistic Options [options] Explanation --auc --scores --confusion --input <file> --model <model> --quiet Show AUC score Show TV and scores for each input Display confusion matrix Input file Read model from specified file Generate less details of execution 63

Medicine Disease recommendation Drug recommendation Case based search Marketing Cell phone companies for identifying users that may switch Recommending books at Amazon Recommending products on the web sites Education Application for Recommendation Framework Universities guiding students what courses to take Conference organizers assigning papers to reviewers 64

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 65

Scope of Text Mining Data Mining Statistics Natural Language Processing Text Mining Web Mining Information Retrieval Computational Linguistics 66

Challenges in Text Mining Each document text may contain large amounts of text High dimensionality Ambiguity of content due to language features Sematic issues Words and phrases may not be semantically independent Complexity of natural language processing 67

General Steps in Text Mining Feature selection Determine ngrams necessary Text Preprocessing Removal of numbers Removal of punctuation marks Text case conversion as needed Stop word removal (can use pre specified list or generic list) Stemming (identify word by its root) 68

Stop Words Most common words in English that do not contribute to classification, clustering or association are: Articles a, an, the Conjunctions and, or Prepositions as, by, of Pronouns you, she, he, it Text documents are high dimension data Removal of stop words acts as technique for dimensionality reduction Other non context related words can also be removed 69

Stemming The process for reducing inflected (or sometimes derived) words to their stem, base or root form Typically achieved by removing ing, s, er ed etc. For example: mining, miner, mines, mined Stemmed word mine Common Algorithms are Porter s Algorithm KSTEM Algorithm Snowball Stemming 70

Steps in Association Mining Loading the data Text preprocessing (as needed) Cleaning Punctuation removal Number removal Stop word removal Stemming Building term document matrix Finding frequent term association 71

Sample Tweeter Data Set

Building a Term Document Matrix OUTPUT: A term-document matrix (16 terms, 20 documents) Non-/sparse entries: 11/309 Sparsity : 97% Maximal term length: 21 Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 data 1 1 0 0 2 0 0 0 0 0 1 2 1 1 1 0 1 0 0 0 databases 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dataframe 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasets 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasetshttptconxrbuh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datastream 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datatable 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 details 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 detection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 development 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 73

Word Frequency in Tweets 74

Displaying the Aggregate Output 75

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 76

Healthcare Expenses Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327. 77

Healthcare Data Types EHR Public Health Public Health Social Behavioral 78

Billing Data Healthcare Data International Classification of Diseases ( ICD) Current Procedural Terminology (CPT) Lab results Logical Observation Identifiers Names and Codes (LOINC) Medication National Drug Code (NDC) by Food and Drug Administration (FDA) 79

Healthcare Data (Cont d) Clinical notes Unstructured text data Image Data Unstructured data Social Interaction data Unstructured data 80

Analytic Platform Structured EHR Clustering Patients / Context Feature Selection Classification Unstructured EHR Recommendation 81

Impact of Data Driven Features Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA 2012. 82

Applications of Patient Similarity Heart Failure Prediction Likelihood of Blood Pressure onset Disease recommendation Medicine recommendation 83

Information Analysis Options Image based Retrieval Case based Retrieval 84

Image Query Image based Retrieval Given a query image and find the most similar images Case based Retrieval Given a case description, details of the symptoms, tests including images Find similar cases including images with case descriptions 85

Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 86

Benefits of Health Care Analytics Better diagnosis Better health care delivery Better value for patient, provider and payer Better innovation Better living 87

Big Data Analysis Challenge Machine Learning Background Domain Expertise Distributed Programming Skills 88

Big Data Analytics: Barriers Cost of analytics Lack of skilled talent Difficult to architect a Big Data Solutions Big data scalability issues Limited capability of existing database analytics Tangible business justification Lack of understanding of Big Data benefits 89

Thank you! Questions? 90