Sunnie Chung. Cleveland State University



Similar documents
Sunnie Chung. Cleveland State University

Cleveland State University

Big Data and Data Science: Behind the Buzz Words

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

ANALYTICS CENTER LEARNING PROGRAM

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data Explained. An introduction to Big Data Science.

Cleveland State University

BIG DATA TRENDS AND TECHNOLOGIES

An interdisciplinary model for analytics education

Cleveland State University

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data on Microsoft Platform

Big Data Technologies Compared June 2014

This Symposium brought to you by

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

COMP9321 Web Application Engineering

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

The 4 Pillars of Technosoft s Big Data Practice

BIG DATA What it is and how to use?

Introduction. A. Bellaachia Page: 1

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The Inside Scoop on Hadoop

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

Taking Data Analytics to the Next Level

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Big Data and Analytics: Challenges and Opportunities

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Transforming the Telecoms Business using Big Data and Analytics

Microsoft Big Data. Solution Brief

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

Applications for Big Data Analytics

Data Warehousing and Data Mining in Business Applications

Are You Ready for Big Data?

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

How To Handle Big Data With A Data Scientist

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hexaware E-book on Predictive Analytics

How To Learn To Use Big Data

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Big Data. Lyle Ungar, University of Pennsylvania

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

An Introduction to Data Mining

Are You Ready for Big Data?

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Getting Started Practical Input For Your Roadmap

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Modernizing Your Data Warehouse for Hadoop

Data Mining Solutions for the Business Environment

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Data Refinery with Big Data Aspects

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Bringing Big Data to People

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The Big Data Market: Business Case, Market Analysis & Forecasts

Il mondo dei DB Cambia : Tecnologie e opportunita`

Statistics for BIG data

Azure Data Lake Analytics

Tap into Hadoop and Other No SQL Sources

Big data for the Masses The Unique Challenge of Big Data Integration

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

I. Justification and Program Goals

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

White Paper: Datameer s User-Focused Big Data Solutions

How To Scale Out Of A Nosql Database

So What s the Big Deal?

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Register on projectbotticelli.com. Introduction to BI & Big Data DAX MDX Data Mining

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Data Science Certificate Program

Predictive Analytics Certificate Program

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data and Industrial Internet

SAP and Hortonworks Reference Architecture

Harnessing the power of advanced analytics with IBM Netezza

Foundations of Business Intelligence: Databases and Information Management

Big Data: Tools and Technologies in Big Data

Real Time Big Data Processing

BIG DATA CHALLENGES AND PERSPECTIVES

Big Data: Study in Structured and Unstructured Data

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Advanced In-Database Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Transcription:

Sunnie Chung Cleveland State University

Data Scientist Big Data Processing Data Mining 2

INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills: to Handle Big Data to Collect, Process and Extract value from Big Data (giant and diverse data sets) to Understand, Visualize and Present their findings to non-data scientists Ability to Create Data-driven Solutions that boost profits, reduce costs and even help save the world 3

And tackle big data projects on every level Big Data and Cloud Projects are in Every CEO s To Do List The Defense Department NASA : Predict Earthquake (specially after Nepal s Earthquake) NSA, Homeland Security : Predict and Prevent Terrorists Acts Internet start-ups Financial institutions 4

Volume : Unprecedentedly Huge Volume of Data fueled by web based business, social networking, micro blogs (e.g., click streams captured in web server logs) e.g.) Ebay processes 8 Peta Bytes data per night Various Structures of Data (No Structure) : Structured (Database, Data Warehouse) Semi-structured (Web pages) and Unstructured (Web Server Log, Sensor Data) most of time!! Velocity : Unprecedentedly generate new data at a high rate e.g.) Streaming Twitter Messages Machine-generated data streaming in from smart devices, sensors, monitors and meters needs big data analytics 5

Numerous new analytic and business intelligence opportunities like: Fraud detection Customer profiling Customer loyalty analysis All of which directly affect revenue of business and critical business decisions. 6

Identifying Field Specific Motive/Purposes Identify Nature of Big Data Source and Data Specific Processes Decisions on Building IT Infrastructure of Big Data Processing Systems Public Cloud/Private Cloud Which MPP Big Data Systems should be built for our specific Big Data Source and Volume Execution of Data Analytics Data Source Modeling Apply Data Mining Strategies Research solutions Implement Big Data Processing Steps for Solutions/Strategies Analyze Results/Interpretation -- Feedback 7

Massively Parallel Processing (MPP) Parallel Data Warehouse (PDW) System Oracle, IBM, Teradata, Microsoft Hadoop System with Map Reduce Google, Yahoo, Facebook, Twitter, LinkedIn Hybrid of Both MPP System on Cloud Amazon, Google, Microsoft, Oracle 8

MPP System Virtual Machine (VM) Cloud Type Cloud as Service Cloud as Platform Cloud as Service Amazon Elastic Cloud Google Cloud Microsoft Cloud: Azure 9

Anomaly detection The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (Dependency modelling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering The task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification The task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Regression attempts to find a function which models the data with the least error. Summarization Providing a more compact representation of the data set, including visualization and report generation. Results validation 10

Statistics Naive Bayes, Clustering > 25 year old Machine Learning Classification Algorithms: Decision Tree, Neural Network >20 year old Database Association Rule Mining, Data Warehouse OLAP > 15 year old All about Big Data Processing Most Current still evolving in fast rate 11

Databases Advanced Modern Databases and Data Processing Strategies Big Data Processing with: Parallel Data Warehouse and OLAP (Online Analytic Processing) Map Reduce Hadoop Based MPP Systems Statistics Data Mining - research from Database: Association Rule Mining - research from Statistics: Clustering - research from Machine Learning: Neural Network And More on recent developments 12

MPP Systems PDW Based Systems : Oracle, IBM, Tera Data, Microsoft PDW In Memeory NEW SQL Systems Hadoop/MapReduce Based Systems: No SQL systems Mongo DB Pig Latin Hbase Hive And So many Others Cloud: Big Data Processing Systems on Cloud Google Cloud, Amazon Cloud, Microsoft Azure, Oracle, IBM 13

http://blogs.the451group.com/opensource/2011/04/15/nosql-newsql-and-beyond-the-answer-to-sprained-relational-databases/ 14

Major Commercial: SAS Enterprise Miner Microsoft Business Intelligence Data Analytic Tool using Databases Popular Free Open Source R/ Map R: A programming language and software environment for statistical computing, data mining, and graphics. GNU Project. Weka: A suite of machine learning software applications written in the Java programming language UIMA:(Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video originally developed by IBM 15

On Databases CIS 530 : Database Concept and Modern Database Processing CIS 611 : Advanced Data Processing Techniques in PDW Parallel Data Warehouse and OLAP On Big Data Processing and Management Systems CIS 612 : Big Data Processing Systems and Modern Database Programming Hadoop and MapReduce - VM(Virtual Machine), Cloud CIS 695: Practicum in Data Analytics and Big Data Processing (Scheduled to be created in Spring 2016) CIS 696: One more new Sunnie courses Chung Cleveland will State be University created on recent research 16

Data Mining CIS 660: Data Mining Techniques from Database, Statistics and Machin Learning EEC 525 Data Mining: Web Data Mining Techniques from Database CIS 667: Bioinformatics (Possibly) 17

Math and Statistics Graduate Certificate in Applied Predictive Modeling MTH 521 : Time Series Analysis MTH 531 : Categorical Data Analysis MTH 537 : Operation Research MTH 567 : Applied Linear Models I MTH 638 : Operation Research II MTH 668 : Applied Linear Models II MTH 675 : Applied Multivariate Statistics 18

Business Analytic Certificates Focus on SAS Certificate with SAS Enterprise Miner Tool BUS 575 : Introduction to Business Analytics BUS 600 : Applied Business Analytics BUS 601 : Managing Databases for Business Analytics BUS 602 : Strategy for Business Analytics BUS 603 : SAS for Data and Statistical Analysis BUS 604: Advanced Business Analytics I BUS 606: Practicum in Business Analytics 19

Explorys by IBM website: https://www.explorys.com/ Data Analytic/ Big Data Processing on Health and Wellness Data Data Analytic for Cleveland Clinic (Tera Data PDW), Metro Health Progressive Big Data Processing on Auto Insurance : Hadoop Based MPP Systems PNC (Tera Data MPP PDW) Big Data Processing Systems on Financial Data 20

Hadoop Big Data Processing Workshop/Meetup EECS Dept of CSU Planning to host the meeting annually to connect our students to the local Big Data Companies Data Scientist Group Regular webinar on Advanced Data Analytic Topics 21

Current Research/Publications at CSU (by Sunnie Chung) Research on the Problems in Developing MPP Systems Research on Integrating Big Data Management Systems (BDBMS) -- Most recent research trends Research on Data Mining for Machine Fault Detection 22

10 out of 23 Programs are Master Degrees on Business Analytics Limited in Basic Statistics and Marketing/Business Oriented Data Mining Tools Only (SAS, MS BI Data Analysis Tool) For Data Scientist Oriented Programs (Typical East Coast Theory Oriented Programs: Columbia, NYU, DePaul, etc) Focus on Predictive Analysis Skill (Math and Stats), Computational Theory on Machine Learning Algorithms Oriented Lack of Practical Data Processing Courses or Big Data System/Cloud Not Many Courses are available Good Data Analytics Programs with Good Balance of Core Subjects, Anaytic Skills and Practicum North Western University Indiana University Bloomington Canegie Mellon 23

MSIA 401 Statistical Methods for Data Mining MSIA 431 Analytics for Big Data MSIA 489 Industry Practicum MSIA 490-21 Predictive Models for Credit Risk Managment MSIA 490-23 Healthcare Analytics MSIA 490-25 Intro to Java Programming MSIA 490-27 Social Networks Analysis MSIA 490 Intro to Databases & Information Retrieval MSIA 411 Data Visualization MSIA 420 Predictive Analytics MSIA 421 Data Mining MSIA 430 Introduction to Data Warehousing and Workflow Management MSIA 490-20 Text Analytics MSIA 490-20 Topics in Analytics with Python MSIA 440 Optimization and Heuristics 24

2 years of Master of Data Science/Data Analytics or Hybrid : Master of Data Science and Computer Information Science Good balance of Courses on Core Subjects: Big Data Processing Application Advanced Database Advanced Algorithm Statistics Data Mining Security in Network System Information Visualization Cloud Computing Variety of good related Courses are available 25

MSIT Business Intelligence & Data Analytics Curriculum: Prerequisite: OOP Programming Courses and 3 years Working Experience Course # Core Courses (60 units required) Units 95-703 Database Management 12 95-796 Statistics for IT Managers 6 95-710 Economic Analysis 6 95-797 Data Warehousing 6 94-806 Privacy in the Digital Age 6 95-868 Exploring and Visualizing Data 6 95-791 Data Mining 6 95-852 Analytics and Business Intelligence 6 95-866 Advanced Business Analytics 6 26

30 credit hours in 2 years CIS 530 : Database Concept and Modern Database Processing CIS 611 : Advanced Data Processing Techniques in Parallel Data Warehouse and OLAP CIS 612 : Big Data Processing Systems and Information Retrieval Hadoop and MapReduce VM(Virtual Machine), Cloud CIS 695: Practicum in Data Analytics and Big Data Processing (In Spring 2016) CIS 660: Data Mining Techniques from Database, Statistics and Machin Learning EEC 525 Data Mining: Web Data Mining Techniques from Database CIS 660: Advanced Algorithm CIS 340: System Programming CIS 260: Java Programming CIS 675 Information Security EEC 693 Network Security and Privacy Applied Predictive Modeling: MTH 531 : Categorical Data Analysis MTH 567 : Applied Linear Models I MTH 668 : Applied Linear Models II MTH 675 : Applied Multivariate Statistics BUS 603 : SAS for Data and Statistical Analysis BUS 604: Advanced Business Analytics I BUS 606: Practicum in Business Analytics 27

Data Visualization 28