Big Data Analytics and Healthcare



Similar documents
Big Data Analytics Opportunities and Challenges

Big Data Analytics for Healthcare

Exploration and Visualization of Post-Market Data

Predictive Care Models to Improve Outcomes Brendan Fowkes Sr. Healthcare Solution Executive May 14, 2013

BIG DATA What it is and how to use?

Transforming the Telecoms Business using Big Data and Analytics

Introduction to Data Mining

How To Handle Big Data With A Data Scientist

Predictive Analytics Certificate Program

The Data Mining Process

Distributed Computing and Big Data: Hadoop and MapReduce

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Text Analytics. A business guide

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

Lecture 10 - Functional programming: Hadoop and MapReduce

Using Data Mining and Machine Learning in Retail

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Chapter 7. Using Hadoop Cluster and MapReduce

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Data Mining in the Swamp

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Data Warehousing and Data Mining in Business Applications

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Testing Big data is one of the biggest

Advanced In-Database Analytics

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Database Marketing, Business Intelligence and Knowledge Discovery

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data. Patrick Derde. Use Cases and Architecture

Clustering Technique in Data Mining for Text Documents

Big Data. Fast Forward. Putting data to productive use

Foundations of Business Intelligence: Databases and Information Management

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

So What s the Big Deal?

Data Refinery with Big Data Aspects

Industry 4.0 and Big Data

Healthcare Big Data Exploration in Real-Time

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Big Data Analytics Best Practices

Introduction to Cloud Computing

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Hexaware E-book on Predictive Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Integrating a Big Data Platform into Government:

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Big Data Analytics in Health Care

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

How Big Data Transforms Data Protection and Storage

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Advanced Big Data Analytics with R and Hadoop

Analyze It use cases in telecom & healthcare

Big Data Challenges in Bioinformatics

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

The Next Wave of Data Management. Is Big Data The New Normal?

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

IT services for analyses of various data samples

Big Data and Complex Networks Analytics. Timos Sellis, CSIT Kathy Horadam, MGS

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Big Data jako součást našeho života. Zdenek Panec: June, 2015

2015 Workshops for Professors

Copyright This report and/or appended material may not be partly or completely published or

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Why is Internal Audit so Hard?

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Customized Report- Big Data

Getting to Know Big Data

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

ANALYTICS BUILT FOR INTERNET OF THINGS

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Big Data Analytics. Lucas Rego Drumond

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Big Data and Healthcare Payers WHITE PAPER

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Analyzing Big Data with AWS

Internals of Hadoop Application Framework and Distributed File System

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Transcription:

Big Data Analytics and Healthcare Anup Kumar, Professor and Director of MINDS Lab Computer Engineering and Computer Science Department University of Louisville

Road Map Introduction Data Sources Structured EHR data Unstructured EHR data Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Applications Conclusions 2

Big Data Applications Advertising and marketing Customer shopping patterns Response to promotional campaign Manufacturing Maintenance of machine health Social Media Browsing and sentiment analysis Impact on buying patterns Email Communication and interaction patterns Influencing the product perception Government data Efficient process management

Big Data Applications (cont d) Stock Market Stock performance prediction Healthcare Management Patient health monitoring Impact of preventive care Financial Institutions Fraud detection and mitigation Weather Prediction Impact analysis and better disaster management

Healthcare Expenses Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.

Why Big Data Analytics Now? Volume Data generated by 2020 will be in Zettabytes (10 21 ) Soon after that the measure will be Yottabyte (10 24 ) and Brontabyte (10 27 ) Variety Structured (transactional data) Unstructured (image, video and text data) Velocity Rate of data generation Increase in the ability to process data Variability/Veracity Trustworthiness of data Quality of data

Big Data Analytics: Benefits Getting to know your patient better Targeted medicine Accurate diagnosis Predicting disease onset Saving money Fraud detection in medical industry Risk Management Lower cost and better outcomes Real time decision making Sensor data analysis Influence of social sentiment on patient health

Data Scientist: A Challenging Combination of Backgrounds Math and Statistics Background Domain Expertise Programming Skills

Big Data Analytics Big Data Analytics = Big Data + Advanced Analytics Advanced Analytics includes: Association Rules Classification and decision trees Text analytics Clustering Regression Machine learning Etc. Deployments of Analytics MapReduced / Hadoop Based Deployment In Database Deployment Data Analytics is applicable to any size of data

Road Map Introduction Data Sources Structured EHR data Unstructured EHR data Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Applications Conclusions 10

Healthcare Data Types EHR Public Health Public Health Social Behavioral

Billing Data Healthcare Data International Classification of Diseases ( ICD) Current Procedural Terminology (CPT) Lab results Logical Observation Identifiers Names and Codes (LOINC) Medication National Drug Code (NDC) by Food and Drug Administration (FDA)

Healthcare Data (Cont d) Clinical notes Unstructured text data Image Data Unstructured data Social Interaction data Unstructured data

Heritage Health Prize

GE Head Health Challenge

GE Challenge I

Road Map Introduction Data Sources Structured EHR data Unstructured EHR data Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Applications Conclusions 17

Analytic Platform Structured EHR Clustering Patients / Context Feature Selection Classification Unstructured EHR Recommendation

Types of Data Analysis Descriptive Analytics (traditional Business Intelligence): Specifies the data characteristics Also known as unsupervised learning How to describe the system? What happened in the system and when? What are the parameters in the systems? What is the impact of a parameter on the system? Is there any co relation between the parameters? Predictive Analytics: Uses data mining and predictive modeling Also know as supervised learning What are the future trends? What is the decision based on past history? Perform what if analysis.

Implementation Options In Database Analytics Distributed Analytics Cluster and cloud computing based Hadoop / MapReduce based

In Database Analytics Allows analytic computation to be carried out in the database Uses SQL and SQL extensions Advantages Computation is close to data and does not require data movement Analytic centralization may allow easy security, data and version management Client access to in database analytics is easy Higher analytics efficiency, easy usability, better database manageability Disadvantages Vendor dependent Limited data type support in databases Cannot run location dependent analytics or Text analytics Cost of analytics

Distributed Analytics Motivation Large data size Complex computation logic Real time processing requirement Cheaper hardware Larger and faster storage space Challenges How to divide and distribute data? How to divide the algorithm? How to manage distributed resources? Limitations Many problems are not suitable for distributed computing Sequential algorithms For example, computing Fibonacci sequence

Cluster and Cloud Computing Benefits Availability of large computing resource Huge storage space availability Easy distribution of data and computation logic Availability of more flexible distributed programming paradigm MapReduce Effective implementation of MapReduce Hadoop R Hadoop

Hadoop Based Analytics Allows analytics to be carried out on any type of data store Provides standard framework for computation On cloud environment On in premises network Advantages Vendor independent Allows any type of analytics to be carried out Provides flexible and adaptable architecture for implementation Disadvantages Complex implementation

MapReduce Allows use of large computational resources Inter cluster communication is managed by MapReduce MapReduce architecture supports Data and task distribution Fault monitoring Task and data replication Simple programming model Limitations of MapReduce Cannot solve all the problems

MapReduce: A Pragmatic Approach It can solve many Big Data problems Data filtering Statistics and aggregation Graph analytics Decision Tree and classification Clustering and recommendations Practical Distributed API Easier to understand and use Higher level APIs exist To reduce the complexity of programming Ability to schedule multi stage jobs

Phases in Hadoop Processing Data Processing with Hadoop goes through three phases Map Phase Processes the data and generate <key, Value> pairs Shuffle Phase Moves the data <key, value> pairs to appropriate processing node for reduction Reduce Phase Processes the data <key, value> pairs to generate final output Hadoop can use multiple machines for each phase

Association Rule Example In order to compute support The number of times each product and its combinations occur in the data has to be calculated The original transaction file format 1, milk, bread 2, bread, butter 3, milk, bread, butter 4, milk Transaction ID Milk (M) Bread (B) Butter (T) 1 1 1 0 2 0 1 1 3 1 1 1 4 1 0 0

Association Rule MapReduce Input Splitting Mapping Shuffling Reducing M B MB B T BT M B T MB MT BT MBT M M B MB B T BT M B T MB MT BT MBT M,1 B,1 MB, 1 B,1 T, 1 BT,1 M, 1 B,1 T, 1 MB,1 BT,1 MT,1 MBT,1 M,1 B,1 M,1 B,1 M,1 B,1 T,1 T,1 MB,1 MB,1 BT,1 BT,1 MT,1 MT,1 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 M M,1

Road Map Introduction Data Sources Structured EHR data Unstructured EHR data Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Applications Conclusions 30 30

Steps in Structured Data Analytics Step 1: Business domain analysis Step 2: Data exploration and investigation Step 3: Data preparation and cleaning Step 4: Model design and development Step 5: Model verification and testing Step 6: Analyze the output

Application for Recommendation Framework Medicine Disease recommendation Drug recommendation Case based search Marketing Cell phone companies for identifying users that may switch Recommending books at Amazon Recommending products on the web sites Education Universities guiding students what courses to take Conference organizers assigning papers to reviewers

Similarity Concept Used in Recommendation Framework 4 5 3 2 1 5 4 6 5 3 3 6 4 1 5 2 5 1 4 3 2 3 5 3 4 4 0 0 5 0 26 45 17 28 23 Similarity Matrix S User Preference Matrix U Recommendation Matrix R Recommender can use the following rule: Matrix[S] x Matrix[U] = Matrix[R]

Road Map Introduction Data Sources Structured EHR data Unstructured EHR data Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Applications Conclusions 34

Scope of Text Mining Data Mining Statistics Natural Language Processing Text Mining Web Mining Information Retrieval Computational Linguistics

Challenges in Text Mining Each document text may contain large amounts of text High dimensionality Difficult to identify which part is important to a pattern Ambiguity of content due to language features Sematic issues Words and phrases may not be semantically independent May have subtle and complex relations between concepts in text Complexity of natural language processing Processing large training set

General Steps in Text Mining Text Splitting Split text into bag of words using text tokenization Disadvantage: often loses semantic meaning Text Preprocessing Removal of numbers Removal of punctuation marks Text case conversion as needed Feature selection Determine ngrams necessary Stop word removal (can use pre specified list or generic list) Stemming (identify word by its root)

General Steps in Text Mining (Cont d) Determining the weighting of individual words Term Frequency Inverse Document Frequency (TF IDF) Creating a Term Document Matrix Terms and frequency of each word in a document Simple TD TF IDF Latent Semantic Indexing matrix

Stop Words Most common words in English that do not contribute to classification, clustering or association are: Articles a, an, the Conjunctions and, or Prepositions as, by, of Pronouns you, she, he, it Text documents are high dimension data Removal of stop words acts as technique for dimensionality reduction Other non context related words can also be removed

Stemming The process for reducing inflected (or sometimes derived) words to their stem, base or root form Typically achieved by removing ing, s, er ed etc. For example: mining, miner, mines, mined Stemmed word mine Common Algorithms are Porter s Algorithm KSTEM Algorithm Snowball Stemming

Steps in Association Mining Loading the data Text preprocessing (as needed) Cleaning Punctuation removal Number removal Stop word removal Stemming Building term document matrix Finding frequent term association

Road Map Introduction Data Sources Structured EHR data Unstructured EHR data Data Analytics Approaches Processing of Structured data Processing of Unstructured data Example Applications Conclusions 42 42

Impact of Data Driven Features Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA 2012. 43

Applications of Patient Similarity Heart Failure Prediction Likelihood of Diabetic onset Disease recommendation Medicine recommendation 44

Medical Imaging

Analysis Outcomes Modality Classification Image based Retrieval Case based Retrieval

Modality Classification

Image Query Image based Retrieval Given a query image and find the most similar images Case based Retrieval Given a case description, details of the symptoms, tests including images Find similar cases including images with case descriptions

Genetic Data Human genome is composed of DNA with four building blocks A, T, C, G Contains three billion pairs of bases of A, T, C, G Size of human genome is 3GB

Genome Wide Association Studies (GWAS) Identifying common genetic factors that influence health and diseases Compare DNA of patients with disease and similar people without disease Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occurs when a single nucleotide in genome sequence differs between individuals

Source Epidemiology Data Surveillance Epidemiology and End Results (SEER) Program at NIH Usage Understanding the disparity in diseases related to race, age and gender Information correlation with other data sources such as pollution, climate and socio economic Can use predictive analysis for various disease trends

Social Networks for Patients PatientsLikeMe Has more than 200,000 patients and is tracking more than 1600 diseases www.patientslikeme.com

Big Data Analytics: Barriers Cost of analytics Lack of skilled talent Difficult to architect a Big Data Solutions Big data scalability issues Limited capability of existing database analytics Tangible business justification Lack of understanding of Big Data benefits

Concluding Remarks Better diagnosis Better health care delivery Better value for patient, provider and payer Better innovation Better living