Internet of Things Data Analytics - Part 1 Introduction to Data Analytics Aveek Dutta Assistant Professor Electrical Engineering and Computer Science University of Kansas e-mail: aveekd@ku.edu http://www.ittc.ku.edu/~aveekd
Three Elements of IoT Origin of Data (Week 2 - Networks) Data Acquisition (Week 3 - Sensors / Android) Data Interpretation (Week 4 - Analytics)
Objectives Attributes of Data (e.g., shape, size, color) Lifecycle approach to data science and analytics From statistics to analytics Differences between Data analyst and BI analyst Apply techniques and tools to analyze Big Data Create statistical models Lead to actionable results Visualization techniques to clearly communicate insights Overview of MapReduce/Hadoop and in-database analytics (Time permitting)
What is YOUR definition of BIG DATA? How big is BIG? Where does it come from? Why is it hard to analyze? What is the value?
Why learn this? Source : Datascience@Berkeley A recent study by the McKinsey Global Institute concludes, "a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge (for the U.S.)." The report estimates that there will be four to five million jobs in the U.S. requiring data analysis skills by 2018, and that large numbers of positions will only be filled through training or retraining. The authors also project a need for 1.5 million more managers and analysts with deep analytical and technical skills "who can ask the right questions and consume the results of analysis of big data effectively."
Key characteristics of Big Data Data Volume 44x increases from 2009 to 2020 0.8 zettabytes to 35.2 zettabytes Processing Complexity Changing data structures Use cases warranting additional transformations and analytical techniques Data Structure Greater variety of data structures to mine and analyze
CISCO VNI Click here for the full report - highly recommended Provides the numbers for networked systems only VELOCITY VARIETY (changes with time) VOLUME
Data Attributes - Temporal Uncertainty Post Analysis - learn from past events Extract models, fit data points Prediction and Forecast based on derived models, Improve models
Data Attributes - Dimensions
Data Attributes - Relationship Public Domain, https://en.wikipedia.org/w/index.php?curid=32880988
Data Attributes - Correlation
Definitions, Drivers and Differences
5 V s of Big Data https://media.licdn.com/mpr/mpr/p/1/005/047/21c/08d279d.jpg And VALUE
Industry Implications Volume Communications/ Topology Network Management Velocity Hybrid SoC (CPUGPU-FPGA) Communications HPC / SAN/Fault Tolerance Variety Applications Classification Veracity Crowdsourcing / Trustworthy Recommendation systems
Data Structures: Increasingly Unstructured Data containing a defined data type, format, structure More Structured Structured Example: Transaction data and OLAP Textual data files with a discernable pattern, enabling parsing Semi-Structured Example: XML data files that are self describing and defined by an xml schema Textual data with erratic data formats, can be formatted with effort, tools, time Quasi Structured Unstructured Example: Web clickstream data that may contain some inconsistencies in data values and formats Data that has no inherent structure and is usually stored as different types of files. Example: Text documents, PDFs, images and video
Four Main Types of Data Structures Quasi-Structured Data Structured Data Semi-Structured Data View Source http://www.google. com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+data&pf=p& sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav =on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651 Unstructured Data The Red Wheelbarrow, by William Carlos Williams
Big Data Ecosystem Data Devices 1 Individual Analytic Services Medical Information Brokers Advertising Marketers Employers Law Enforcement 2 Internet Government Data Collectors Websites 3 Data Aggregators Data Users/Buyers 4 Catalog Co-Ops Phone/TV Media Media Archives Credit Bureaus Retail Financial Banks Government List Brokers Delivery Service Private Investigators /Lawyers
Sources of Big Data - An Overview Source: Kapow Software, a Kofax company
Data Repositories Data Islands Spreadmarts Isolated data marts Spreadsheets and lowvolume DB s for recordkeeping Analyst dependent on data extracts Data Warehouses Analytic Sandbox Centralized data containers in a purpose-built space Data assets gathered from multiple sources and technologies for analysis Supports BI and reporting, but restricts robust analyses Enables high performance analytics using in-db processing Analyst dependent on IT & DBAs for data access and schema changes Reduces costs associated with data replication into "shadow" file systems Analysts must spend significant time to get extracts from multiple sources Analyst-owned rather than DBA owned
Business Intelligence vs. Data Science Predictive Analytics & Data Mining (Data Science) High Data Science BUSINESS VALUE Optimization, predictive modeling, forecasting, statistical analysis Structured/unstructured data, many types of sources, very large data sets Common Questions What if..? What s the optimal scenario for our business? What will happen next? What if these trends continue? Why is this happening? Business Intelligence Business Intelligence Low Past Typical Techniques & Data Types TIME Typical Techniques & Data Types Standard and ad hoc reporting, dashboards, alerts, queries, details on demand Structured data, traditional sources, manageable data sets Common Questions What happened last quarter? How many did we sell? Where is the problem? In which situations? Future
Profile of a Data Scientist Quantitative Technical Skeptical Curious & Creative Communicative & Collaborative
Big Data Analytics: Industry Examples Health Care Reducing Cost of Care Medical Public Services Internet Government Life Sciences Data Collectors Genomic Mapping IT Infrastructure Preventing Pandemics Unstructured Data Analysis Online Services Phone/TV Retail Financial Social Media for Professionals Module 1: Introduction to BDA 23
Big Data Analytics: Healthcare Situation Poor police response and problems with medical care, triggered by shooting of a Rutgers student The event drove local doctor to map crime data and examine local health care Dr. Jeffrey Brenner generated his own crime maps from medical Use of Big Data billing records of 3 hospitals Key Outcomes City hospitals & ER s provided expensive care, low quality care Reduced hospital costs by 56% by realizing that 80% of city s medical costs came from 13% of its residents, mainly low-income or elderly Now offers preventative care over the phone or through home visits
Big Data Analytics: Public Services Situation Threat of global pandemics has increased exponentially Pandemics spreads at faster rates, more resistant to antibiotics Created a network of viral listening posts Combines data from viral discovery in the field, research in disease hotspots, and social media trends Use of Big Data Using Big Data to make accurate predictions on spread of new pandemics Key Outcomes Identified a fifth form of human malaria, including its origin Identified why efforts failed to control swine flu Proposing more proactive approaches to preventing outbreaks 25
Big Data Analytics: Life Sciences Situation Broad Institute (MIT & Harvard) mapping the Human Genome In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes Use of Big Data Developed 30+ software packages, now shared publicly, along with the genomic data Key Outcomes Using genetic mappings to identify cellular mutations causing cancer and other serious diseases Innovating how genomic research informs new pharmaceutical drugs
Big Data Analytics: IT Infrastructure Situation Explosion of unstructured data required new technology to analyze quickly, and efficiently Doug Cutting created Hadoop to divide large processing tasks into smaller tasks across many computers Use of Big Data Analyzes social media data generated by hundreds of thousands of users Key Outcomes New York Times used Hadoop to transform its entire public archive, from 1851 to 1922, into 11 million PDF files in 24 hrs Applications range from social media, sentiment analysis, wartime chatter, natural language processing
Big Data Analytics: Online Services Situation Opportunity to create social media space for professionals Collects and analyzes data from over 100 million users Use of Big Data Adding 1 million new users per week Key Outcomes LinkedIn Skills, InMaps, Job Recommendations, Recruiting Established a diverse data scientist group, as founder believes this is the start of Big Data revolution
Data Analytics Lifecycle
How to Approach Your Analytics Problems Your Thoughts? How do you currently approach your analytics problems? Do you follow a methodology or some kind of framework? How do you plan for an analytic project? Module 2: Data Analytics Lifecycle 30
Value of Using the Data Analytics Lifecycle Focus your time Ensure rigor and completeness Enable better transition to members of the cross-functional analytic teams Repeatable Scale to additional analysts Support validity of findings Module 2: Data Analytics Lifecycle 31
Data Analytics Lifecycle Do I have enough information to draft an analytic plan and share for peer review? 1 Discovery 2 6 Operationalize Data Prep Do I have enough good quality data to start building the model? 3 5 Model Planning Communicate Results 4 Is the model robust enough? Have we failed for sure? Model Building Do I have a good idea about the type of model to try? Can I refine the analytic plan?
Phase 1: Discovery Do I have enough information to draft an analytic plan and share for peer review? 1 Discovery Learn the Business (Problem) Domain Datadownstream Prep Operationalize Determine amount of domain knowledge needed to interpret results Do I have enough good quality data to start building the model? Determine the general analytic problem type (such as clustering, classification) If you don t know, then conduct initial research to learn about the domain area you ll be analyzing Learn from the past (aka literature review) Communicate Model Have there been previous attempts in the organization to solve this problem? Results Planning If so, why did they fail? Why are we trying again? How have things changed? Is the model robust enough? Have we failed for sure? Model Building Do I have a good idea about the type of model to try? Can I refine the analytic plan?
Phase 1: Discovery Do I have enough information to draft an analytic plan and share for peer review? 1 Discovery Resources Operationalize Assess available technology, Available data People (team), time (man-hours) Data Prep Do I have enough good quality data to start building the model? Frame the problem..it is the process of stating the analytics problem to be solved Model Communicate State the analytics problem, why it is important, and to whom Results Planning Clearly articulate the current situation and pain points Objectives What is the goal? What are the Model criteria for success? What s good Doenough? I have a good idea about type model Building model What is the failure criterion (when do we just stop trying or settle forthe what weofhave)? Is the robust to try? Can I refine the enough? Have we analytic plan? failed for sure?
Phase 1: Discovery Do I have enough information to draft an analytic plan and share for peer review? 1 Discovery Formulate Initial Hypotheses IH, H1, H2, H3, Hn Operationalize Data Prep Do I have enough good quality data to start building the model? Gather and assess hypotheses from stakeholders and domain experts Preliminary data exploration to inform discussions with stakeholders during the hypothesis forming stage Identify Data Sources Begin Learning the Data Communicate Model Results sources for previewing the data and provide high-level Planning Aggregate understanding Review the raw data Determine the structures and tools needed Model Scope the kind of data needed for this kind of problem Building Is the model robust enough? Have we failed for sure? Do I have a good idea about the type of model to try? Can I refine the analytic plan?
Phase 2: Data Preparation Prepare Analytic Sandbox Discovery Bandwidth and network Perform ELT (Extract - Load - Transform) Operationalize Data Conditioning Clean and normalize data Discern what you keep vs. what you discard Survey & Visualize Communicate Overview, zoom & filter, details-on-demand Results Do I have enough information to draft an analytic plan and share for peer review? 2 Data Prep Do I have enough good quality data to start building the model? Model Planning Descriptive Statistics Model Building Do I have a good idea about the type of model Useful Tools for this phase: Is the model robust to try? Can I refine the For Data Transformation & Cleansing: SQL, Hadoop, MapReduce, Alpine Miner enough? Have we analytic plan? Visualization: R (base package, ggplot and lattice), GnuPlot, Ggobi/Rggobi, Spotfire, Tableau failed for sure?
Phase 3: Model Planning Do I have enough information to draft an analytic plan and share for peer review? Discovery Determine Methods Select methods based on hypotheses, data structure and volume Operationalize Data Prep Do I have enough good quality data to start building the model? Ensure techniques and approach will meet business objectives Variable Selection Inputs from stakeholders and domain experts Communicate Results Capture essence of the predictors, leverage a technique for dimensionality reduction Iterative testing to confirm the most significant variables Model Useful Tools for this phase: R/PostgreSQL, SQL Analytics, Building the model robust Alpine Is Miner, SAS/ACCESS, SPSS/OBDC enough? Have we failed for sure? 3 Model Planning Do I have a good idea about the type of model to try? Can I refine the analytic plan?
Phase 4: Model Building Do I have enough information to draft an analytic plan and share for peer review? Discovery Develop data sets for testing, training, and production purposes Get the best environment you can for building models and workflows fast hardware, parallel processing Do I have enough good Need to ensure that the model data is sufficiently robust for the model and analytical quality data to techniques start building Smaller, test sets for validating approach, training set for initial experiments the model? Operationalize Data Prep Communicate Results Model Planning 4 Is the model robust enough? Have we failed for sure? Model Building Do I have a good idea about the type of model to try? Can I refine the analytic plan? Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner
Phase 5: Communicate Results Do I have enough information to draft an analytic plan and share for peer review? Discovery Do I have enough good quality data to start building the model? Did we succeed? Did we fail? Operationalize 5 Communicate Results Is the model robust enough? Have we failed for sure? Data Prep Interpret the results Compare to IH s from Phase 1 Identify key findings Model Planning Quantify business value Summarizing findings (depends on audience) Model Building Do I have a good idea about the type of model to try? Can I refine the analytic plan?
Phase 6: Operationalize Do I have enough information to draft an analytic plan and share for peer review? Discovery 6 Operationalize Communicate Results Is the model robust enough? Have we failed for sure? Run a pilot Assess the benefits Do I have enough good quality data to start building the model? Data Prep Provide final deliverables Implement the model in the production environment Model Define process to update, retrain, and retire Planning the model, as needed Model Building Do I have a good idea about the type of model to try? Can I refine the analytic plan?
We should be confident in answering... What is Big-Data? Why is it hard and challenging? What are attributes of data? What are 5 V s of analytics? Structure, source and management of data What are the differences with BI What are the components of Data Analytics Lifecycle Discovery, Data Prep, Model Planning, Model Building, Communicate Results, Operationalize More Reading - Free ebooks on Data Analysis