Big Data Analytics Opportunities and Challenges Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 2
Big Data Applications Advertising and marketing Customer shopping patterns Response to promotional campaign Manufacturing Maintenance of machine health Social Media Browsing and sentiment analysis Impact on buying patterns Government data Efficient process management 3
Big Data Applications (cont d) Stock Market Stock performance prediction Healthcare Management Patient health monitoring Impact of preventive care Financial Institutions Fraud detection and mitigation Weather Prediction Impact analysis and better disaster management 4
Big Data Analytics: Benefits Saving money Fraud detection in medical industry Risk Management Lower cost and better outcomes Real time decision making Sensor data analysis Influence of social sentiment on patient health Getting to know your customer better Targeted advertisement Better recommendations 5
Components of Big Data Volume Data generated by 2020 will be in Zettabytes (10 21 ) Soon after that the measure will be Yottabyte (10 24 ) and Brontabyte (10 27 ) Variety Structured (transactional data) Unstructured (image, video and text data) Velocity Rate of data generation Increase in the ability to process data Variability/Veracity Trustworthiness of data Quality of data 6
Key Components of Big Data Analytics Understanding Advanced Analytics Association analysis Clustering, Classification, Regression Recommendation framework Time series analysis Handling volume and velocity of data Map/Reduce framework Cloud computing framework Custom frameworks Dealing with Variety of Data Structured Data Analytics (Advanced Analytics) Unstructured Data Analytics (Text Processing) 7
Data Analysis Options Descriptive analytics : Specifies the data characteristics How to describe the system? What happened in the system and when? What are the parameters in the system? What is the impact of a parameter on the system? Is there any correlation between the parameters? Predictive analytics: uses data mining and predictive modeling It can answer the following questions What are the future trends? What is the decision based on past history? Perform what if analysis 8
Handling Large Volume Hadoop Based Implementation of Analytics In Database Analytics 9
Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides a standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Disadvantages Complex implementation 10
In Database Analytics Allows analytic computation to be carried out in the database Uses: SQL SQL extensions Advantages Higher analytics efficiency, easy usability, better database manageability Analytic centralization may allow easy security, data, and version management Disadvantages Vendor dependent Cost 11
Handling of Variety of Data Structured Data Analytics Association analysis Clustering Classification Regression Recommendation framework Time series analysis Unstructured Data Analytics Association analysis Clustering Classification Text processing 12
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 13
Introduction to Hadoop Open source from Apache Free download from http://hadoop.apache.org Runs on commodity hardware Lowers procurement, licensing, and operational costs Scalable Incrementally add hardware nodes as needed Can accommodate growth in data and processing requirements Fault tolerant Automatic data replication and task reallocation 14
Core Components of Hadoop Big Data Architecture Integration Data Processing (MapReduce) Security Operations Data Storage (HDFS) 15
Hadoop Distributed File System (HDFS) Designed for large scale data processing Allows storage of large data files Runs on top of native file system on each node Data is stored in units known as blocks 64MB (default) although normally larger Large files are stored in many blocks over many nodes Blocks are replicated to multiple nodes 16
HDFS Blocks and Replication File Block 1 Block 2 Block 3 Node 1 Block 1 Node 2 Block 1 Node 3 Block 3 Block 2 Block 3 Block 2 17
Hadoop Node Types in a Cluster Nodes in Hadoop have different roles to play HDFS nodes NameNodes DataNodes MapReduce nodes JobTracker TaskTracker 18
Hadoop Node Interaction Start Hadoop Cluster JobTracker MapReduce TaskTracker TaskTracker HDFS NameNode DataNode DataNode DataNode DataNode 19
Role of JobTracker Receives jobs from clients and manages jobs Master in master slave architecture with TaskTrackers Determines job execution plan Assigns nodes to different processing tasks Monitors tasks as they execute Detects failures and restarts/re runs tasks 20
Role of TaskTracker Manages individual tasks that the JobTracker issues Can be map or reduce jobs Single TaskTracker can run many map or reduce tasks in parallel Keeps JobTracker updated with task progress using heartbeat signal Failure to send heartbeat results in JobTracker resubmitting task To another TaskTracker node 21
Role of NameNode HDFS is also a master slave architecture NameNode is the HDFS master directing DataNode slave nodes DataNodes perform low level input/output This is where data is stored Keeps track of files How they are broken into blocks Which nodes blocks are stored on Only one per Hadoop cluster! Single point of failure, so should not use commodity hardware 22
Role of DataNode Performs low level work reading and writing HDFS blocks to local file system Communicates with: NameNode Provides information about which blocks they are storing DataNodes to replicate data blocks Many per Hadoop cluster Normally run on inexpensive commodity hardware 23
Hadoop Execution Modes Hadoop can be configured to run in one of three modes Local (standalone) mode Pseudo distributed mode Fully distributed mode Four key configuration files define which mode to run in core-site.xml hdfs-site.xml mapred-site.xml 24
Local (Standalone) Mode This is the default mode for Hadoop No assumptions are made about hardware Does not use HDFS Used for developing and debugging application logic of Hadoop programs 25
Pseudo Distributed Mode Hadoop runs in clustered mode Cluster size is one Useful for developing and debugging Hadoop applications Enables examination of: Memory and HDFS usage Configuration files used to set port numbers used for NameNode and JobTracker 26
Fully Distributed Mode This is the mode for running Hadoop applications in production Configuration requires setup of Master nodes Host of NameNode and JobTracker components Slave nodes Machines running DataNode and TaskTracker components NameNode hosts a report on HDFS status on port number 50070 Enables checking of each DataNode in cluster JobTracker provides status report on MapReduce jobs on port 50030 Reports about ongoing jobs 27
Introduction to Pig Hadoop is powerful for processing large datasets Requires significant programming skills to develop applications Pig is a Hadoop extension that simplifies Hadoop programming It is Easy to use It is Higly scalable There are two major components of Pig A data processing language called Pig Latin A compiler to generate Pig Latin to Hadoop programs 28
Introduction to Hive A data warehousing package built on top of Hadoop Originally developed at Facebook Target audience for Hive is data analysts comfortable with SQL Provides access to ad hoc queries and data analysis On large scale datasets Focuses on structured data Enables optimizations to be performed Provides SQL like language HiveQL Based on tables, rows, columns No need to know about Hadoop programming 29
Working With Hive Hive requires a metastore service Provides schemas to structure the data Helps with efficient querying and processing Implemented using tables in a relational database By default, Hive uses Java s built in Derby database Hive can be accessed from Command line interface Web interface known as Hive web Interface 30
Introduction to Apache Sentry Features Secure authorization Ability to control and enforce access to data Grant privileges on data for authenticated users. Fine grained access control Control at the database, table, and view level As an example of fine grained control One group can see all data Another group can see only non sensitive data filtered by a view Sentry provides unified security platform: Existing Hadoop Kerberos security for authentication Sentry access policy applied to all applications e.g. Pig, Hive etc 31
Introduction to Oozie Oozie is a Hadoop workflow engine Manages data processing activities Available from http://oozie.apache.org Simplifies the management of Recurring and data driven jobs Re execution of failed jobs Comprised of A workflow engine Can execute different types of jobs A coordinator engine 32
Oozie Workflows Oozie workflows are directed acyclic graphs (DAGs) Graph nodes are either Action nodes Perform a workflow task Control flow nodes Determine the logic between tasks There is always exactly one start node and at least one terminal node start action action terminal terminal 33
Pig, Hive, and Impala in Big Data Architecture Integration Pig Hive Impala Mahout Security Operations Map Reduce Sqoop HDFS Sentry Flume Oozie 34
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 35
Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Provides flexible and adaptable architecture for implementation Disadvantages Complex implementation 36
MapReduce Features Allows use of large computational resources Inter cluster communication is managed by MapReduce MapReduce architecture supports Data and task distribution Fault monitoring Task and data replication Simple programming model Limitations of MapReduce Cannot solve all the problems 37
MapReduce: A Pragmatic Approach It can solve many Big Data problems Data filtering Statistics and aggregation Graph analytics Decision Tree and classification Clustering and recommendations Practical Distributed API Easier to understand and use Higher level APIs exist To reduce the complexity of programming Ability to schedule multi stage jobs 38
Phases in MapReduce Processing Data Processing with MapReduce goes through three phases Map Phase Processes the data and generate <key, Value> pairs Shuffle Phase Moves the data <key, value> pairs to appropriate processing node for reduction Reduce Phase Processes the data <key, value> pairs to generate final output Hadoop can use multiple machines for each phase 39
Association Rule Example In order to compute support The number of times each product and its combinations occur in the data has to be calculated The original transaction file format 1, milk (M), bread (B) 2, bread (B), butter (T) 3, milk (M), bread (B), butter(t) 4, milk(m) The input data file would be: M B MB B T BT M B T MB MT BT MBT M 40
Association Rule MapReduce Input Splitting Mapping Shuffling Reducing M B MB B T BT M B T MB MT BT MBT M M B MB B T BT M B T MB MT BT MBT M,1 B,1 MB, 1 B,1 T, 1 BT,1 M, 1 B,1 T, 1 MB,1 BT,1 MT,1 MBT,1 M,1 B,1 M,1 B,1 M,1 B,1 T,1 T,1 MB,1 MB,1 BT,1 BT,1 MT,1 MT,1 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M M,1 41
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 42
Steps in Structured Data Analytics Step 1: Business domain analysis Step 2: Data exploration and investigation Step 3: Data preparation and cleaning Step 4: Model design and development Step 5: Model verification and testing Step 6: Analyze the output 43
Descriptive and Predictive Analytics Descriptive analytics provides knowledge discovered from the data set Examples include Clustering Correlation analysis Pattern discovery Association rules Predictive analytics builds a model to predict future behavior Examples include Regression models Time series analysis 44
Introduction to Mahout Open source machine learning library Leveraging the power of MapReduce Hadoop Currently available algorithms are Various clustering algorithms Singular value decomposition Parallel Frequent Pattern mining Naive Bayes classifier Random forest decision tree based classifier Collaborative filtering User and item based recommenders Others 45
Mahout Architecture Application Application Application Application Application Mahout: Machine Learning Library Clustering, Classification, Recommendation, Filtering, etc. Hadoop/MapReduce Storage Storage Storage Storage Compute Compute Compute Compute Storage Storage Storage Storage Compute Compute Compute Compute 46
Mahout Machine Learning Library Interaction Command line interaction Allows execution of machine learning algorithms using Mahout commands Pros: Easy use of algorithms Cons: Have to specify large number of options; fixed number of options are available Application Program Interface based interaction Allows execution of machine learning algorithms in the source code Pros: More flexibility is available in using the algorithms Cons: Have to know programming 47
Clustering: Introduction Clustering is the grouping of available data based on similarities on some pre specified criteria Uses unsupervised learning techniques Uses statistical methods A way of looking for patterns or structure in the data that are of interest Allows partitioning of data in different groups Identifies a set of patterns and structures Can be used as a standalone technique to gain insight into data distribution 48
Clustering in Mahout Two core components of clustering are An algorithm to group items in clusters Kmeans clustering Hierarchical clustering Many others Concept of similarity and dissimilarity This specifies how close various data locations are 49
Business Applications for Clustering Market research Segmentation Targeted advertisements Customer categorization Government management Crime hot spot analysis Census data for economic and social analysis Inventory management Location based product stocking 50
Basic Steps in Clustering Select features from the data sets Choose clustering approach Select parameters for clustering 51
/bin/mahout kmeans -i input-file -o output -c clusters -dm org.apache.mahout.common.distance.cosinedistancemeasure -x 5 -ow -cd 1 -k 25 Executing Mahout Clustering Algorithm 52
Association Rules: Introduction Association rule (AR) mining involves finding one of the following from among a set of data Frequent patterns Associations Correlation Causal structures AR is used to find interesting relationships between variables in transactionoriented data sets Association rules are of the form {X} > {Y } with [support, confidence, lift] 53
Business Applications of Association Rules Cross selling and up selling Market basket analysis Catalog design Keeping the related items together Planning and monitoring Network design based on web logs Planning web design using web usage log mining In general, any decision based on knowledge of previous transactions by sets of users 54
Introduction to Classification Classification separates data into various categories Comprises assigning a class label to a set of unclassified data sets The classification of an unknown observation to a class is based on the model Thus, classification is categorized as supervised learning Classification is a two step process Model training and creation Based on training data set where the observations have already been classified Model testing Apply the model to test data to classify them in appropriate categories 55
Classification Terminology Training examples Complete set of inputs for model development and testing Training data Subset of training examples used for building the model Model Set of rules generated after training the classifier Test data Remaining examples from training examples not used in training data to study how effective the model was Predictor variables Set of variables that are used to decide the output for target variable Target variable This is the variable the classifier is estimating, based on predictor variables 56
The Structure of Input for Mahout Classifier The general structure for input for training is TV, PV 1, PV 2, PV 3,..,PV n The general structure for test input is??, PV 1, PV 2, PV 3,..,PV n?? refers to the target variable that needs to be evaluated based on predictor variable TV, PV 1, PV 2, PV 3,..,PV n TV, PV 1, PV 2, PV 3,..,PV n TV, PV 1, PV 2, PV 3,..,PV n Classifier model generation??, PV 1, PV 2, PV 3,..,PV n??, PV PV = predictor variable 1, PV 2, PV 3,..,PV n TV = target variable??, PV 1, PV 2, PV 3,..,PV n TV Classifier generated model TV TV 57
Business Applications of Classification Loan application processing Categorization of customer types Identification of product classes Classification of a privacy policy as safe or unsafe Medical diagnosis Targeted marketing 58
Classification Modeling Techniques Decision tree Random forest Many others 59
Decision Tree Modeling This provides tree structure with Internal nodes representing a test on a certain attribute The result of the test is represented by the branch at internal node A class is represented by a leaf node Starts with the root node and splits into multiple branches May further split into new nodes Ends in leaf nodes that contain the decisions Model can be in the form of a set of rules or a decision tree Rules specify the decision making process of the model Uses recursive partitioning approach (divide and conquer) The process stops when partitioning does not improve the outcome 60
Algorithms Supported by Mahout for Clustering Stochastic Gradient Descent (SGD) Can run in sequential, online, and incremental mode Data set of less than tens of millions is suitable for processing Naive Bayes Can run in parallel mode Data set of millions to hundreds of millions is suitable for processing Random forest Can run in parallel mode Data set of less than tens of millions is suitable for processing 61
Command for Training the SGD Classifier Training: bin/mahout trainlogistic [options] Major Options --input <file> --output <file> --target <variable> --categories <n> --predictors <pv1>.<pvn> --types <tpv1>,<tpvn> --passes --rate --quiet Input file Output file Target variable Possibilities in TV Predictor variable Types of PVs Explanation Number of times input should be used Specify initial learning rate Generate less details of execution 62
Command for Evaluating the SGD Classifier Evaluating: bin/mahout runlogistic Options [options] Explanation --auc --scores --confusion --input <file> --model <model> --quiet Show AUC score Show TV and scores for each input Display confusion matrix Input file Read model from specified file Generate less details of execution 63
Medicine Disease recommendation Drug recommendation Case based search Marketing Cell phone companies for identifying users that may switch Recommending books at Amazon Recommending products on the web sites Education Application for Recommendation Framework Universities guiding students what courses to take Conference organizers assigning papers to reviewers 64
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 65
Scope of Text Mining Data Mining Statistics Natural Language Processing Text Mining Web Mining Information Retrieval Computational Linguistics 66
Challenges in Text Mining Each document text may contain large amounts of text High dimensionality Ambiguity of content due to language features Sematic issues Words and phrases may not be semantically independent Complexity of natural language processing 67
General Steps in Text Mining Feature selection Determine ngrams necessary Text Preprocessing Removal of numbers Removal of punctuation marks Text case conversion as needed Stop word removal (can use pre specified list or generic list) Stemming (identify word by its root) 68
Stop Words Most common words in English that do not contribute to classification, clustering or association are: Articles a, an, the Conjunctions and, or Prepositions as, by, of Pronouns you, she, he, it Text documents are high dimension data Removal of stop words acts as technique for dimensionality reduction Other non context related words can also be removed 69
Stemming The process for reducing inflected (or sometimes derived) words to their stem, base or root form Typically achieved by removing ing, s, er ed etc. For example: mining, miner, mines, mined Stemmed word mine Common Algorithms are Porter s Algorithm KSTEM Algorithm Snowball Stemming 70
Steps in Association Mining Loading the data Text preprocessing (as needed) Cleaning Punctuation removal Number removal Stop word removal Stemming Building term document matrix Finding frequent term association 71
Sample Tweeter Data Set
Building a Term Document Matrix OUTPUT: A term-document matrix (16 terms, 20 documents) Non-/sparse entries: 11/309 Sparsity : 97% Maximal term length: 21 Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 data 1 1 0 0 2 0 0 0 0 0 1 2 1 1 1 0 1 0 0 0 databases 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dataframe 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasets 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasetshttptconxrbuh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datastream 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datatable 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 details 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 detection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 development 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 73
Word Frequency in Tweets 74
Displaying the Aggregate Output 75
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 76
Healthcare Expenses Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327. 77
Healthcare Data Types EHR Public Health Public Health Social Behavioral 78
Billing Data Healthcare Data International Classification of Diseases ( ICD) Current Procedural Terminology (CPT) Lab results Logical Observation Identifiers Names and Codes (LOINC) Medication National Drug Code (NDC) by Food and Drug Administration (FDA) 79
Healthcare Data (Cont d) Clinical notes Unstructured text data Image Data Unstructured data Social Interaction data Unstructured data 80
Analytic Platform Structured EHR Clustering Patients / Context Feature Selection Classification Unstructured EHR Recommendation 81
Impact of Data Driven Features Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA 2012. 82
Applications of Patient Similarity Heart Failure Prediction Likelihood of Blood Pressure onset Disease recommendation Medicine recommendation 83
Information Analysis Options Image based Retrieval Case based Retrieval 84
Image Query Image based Retrieval Given a query image and find the most similar images Case based Retrieval Given a case description, details of the symptoms, tests including images Find similar cases including images with case descriptions 85
Road Map Introduction Hadoop Echo System Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Healthcare Applications Concluding Observations 86
Benefits of Health Care Analytics Better diagnosis Better health care delivery Better value for patient, provider and payer Better innovation Better living 87
Big Data Analysis Challenge Machine Learning Background Domain Expertise Distributed Programming Skills 88
Big Data Analytics: Barriers Cost of analytics Lack of skilled talent Difficult to architect a Big Data Solutions Big data scalability issues Limited capability of existing database analytics Tangible business justification Lack of understanding of Big Data benefits 89
Thank you! Questions? 90