Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data



Similar documents
CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Big Data and Analytics (Fall 2015)

CS Data Science and Visualization Spring 2016

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data Analytics. Lucas Rego Drumond

Information Management course

Estimating PageRank Values of Wikipedia Articles using MapReduce

BIG DATA What it is and how to use?

The Internet of Things and Big Data: Intro

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Challenges for Data Driven Systems

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

COMP9321 Web Application Engineering

Big Data and Analytics: Challenges and Opportunities

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

International Journal of Innovative Research in Computer and Communication Engineering

Office: LSK 5045 Begin subject: [ISOM3360]...

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Machine Learning for Data Science (CS4786) Lecture 1

Big Data Systems CS 5965/6965 FALL 2015

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Advanced Big Data Analytics with R and Hadoop

Sources: Summary Data is exploding in volume, variety and velocity timely

Advanced In-Database Analytics

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Innovative, High-Density, Massively Scalable Packet Capture and Cyber Analytics Cluster for Enterprise Customers

Big Systems, Big Data

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Infrastructures for big data

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Big Data With Hadoop

Spark and the Big Data Library

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Customized Report- Big Data

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

RevoScaleR Speed and Scalability

Transforming the Telecoms Business using Big Data and Analytics

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

NextGen Infrastructure for Big DATA Analytics.

How To Use Big Data For Telco (For A Telco)

Fast Analytics on Big Data with H20

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Why is Internal Audit so Hard?

BIG DATA CHALLENGES AND PERSPECTIVES

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Oracle Big Data SQL Technical Update

UNIVERSITY OF SOUTHERN CALIFORNIA Marshall School of Business BUAD 425 Data Analysis for Decision Making (Fall 2013) Syllabus

How Companies are! Using Spark

The 4 Pillars of Technosoft s Big Data Practice

class 1 welcome to CS265! BIG DATA SYSTEMS prof. Stratos Idreos

CS144R/244R Network Design Project on Software Defined Networking for Computing

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Data Warehousing in the Age of Big Data

Making Sense of Big Data in Insurance

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Machine Learning over Big Data

Big Data and Data Science: Behind the Buzz Words

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

Large-Scale Data Processing

Predictive Modeling and Big Data

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

2015 Workshops for Professors

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Introduction to Hadoop

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Azure Machine Learning, SQL Data Mining and R

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

ANALYTICS CENTER LEARNING PROGRAM

Big Data Systems CS 5965/6965 FALL 2014

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Spatio-Temporal Networks:

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

BIG DATA FUNDAMENTALS

Client Overview. Engagement Situation. Key Requirements

Course Syllabus. Purposes of Course:

GigaSpaces Real-Time Analytics for Big Data

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Industrial and Systems Engineering Master of Science Program Data Analytics and Optimization

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Transcription:

CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address How long they ve been in the same job if they are married Whether they own a car Helping healthcare providers save money Car ownership vs. taking antibiotics? PART 0. INTRODUCTION 1. INTRODUCTION TO BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 W1.A.3 Look Who s Peeking at Your Paycheck Experian s Income Insight Estimates people s income level Based on their credit history Trains the estimation model using selected credit history and tax information from the IRS W1.A.4 The Artemis project: Saving preemies using Big Data The Artemis project Dr. Carolyn McGregor Toronto s Hospital for Sick Children, University of Ontario Institute of Technology and IBM Captures and processes the patients data in real time 16 different data streams Heart rate, respiration rate, temperature, blood pressure and blood oxygen level Around 1,260 data points per second System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear KAREN BLUMENTHAL, Look Who s Peeking at Your Paycheck, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106 What is Big Data? W1.A.5 Achieving Small Miracles from Big Data, https://www.ibm.com/smarterplanet/global/files/ ca en_us healthcare smarter_healthcare_data_baby.pdf W1.A.6 Big Data Things one can do at a large scale that cannot be done at a smaller scale To extract new insights Create new forms of values Big Data is about analytics of huge quantities of data in order to infer probabilities Big Data is NOT about trying to teach a computer to think like humans Providing a quantitative dimension it never had before 1

W1.A.7 W1.A.8 The three(or four) Vs in Big Data Volume Voluminous It does not have to be certain number of petabytes, etc. Velocity How fast the data is coming in How fast you need to be able to analyze and utilize it Variety Number of sources or incoming vectors Veracity Can you trust the data itself, source of the data, or the process? User entry errors, redundancy, corruption of values Data cleaning Related research areas Storage systems How can we efficiently resolve queries on massive amounts of input data? The input dataset may be presented in the form of a distributed data stream Machine learning How can we efficiently solve large-scale machine learning problems? The input data may be massive; stored in a distributed cluster of machines Distributed computing How can we efficiently solve large-scale optimization problems in distributed computing environments? For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large scale graphs? W1.A.9 W1.A.10 CS535 BIG DATA Goal of this course Understanding fundamental concepts in Big Data Analytics Learn about existing technologies and how to apply them PART 0. INTRODUCTION 2. COURSE INTRODUCTION Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 W1.A.11 W1.A.12 Related courses When is the best time to take this course? Machine Learning/Statistics Distributed Systems Database Systems Computer Communications Computer Security 2

W1.A.13 W1.A.14 About me Research area Big Data Storage, retrievals, analytics and visualization Projects Glean Scalable analytics Galileo Distributed data storage system for large scale geospatial time-series datasets Mendel Scalable biological data storage system GeoLens Visual Analytics Communications Course Website http://www.cs.colostate.edu/~cs535 Announcements: Check the course website at least twice a week. Schedule (course materials, readings, assignments) Policies Canvas Assignment submission Discussion board Grades Contact Instructor sangmi@cs Office hour: T 11:00AM ~ noon and by appointment Office: CSB456 URL: http://www.cs.colostate.edu/~sangmi W1.A.15 W1.A.16 Course Structure Course Topics (1/2) A paradigm for Big Data; Lambda Architecture Distributed Computing Model for Scalable Batch Computing MapReduce Part 0 Introduction to Big Data Part 1 Batch Computing Models for Big Data Analytics Part 2 Scalable Frameworks for Big Data Analytics Distributed Data File System for Scalable Batch Computing Google File System Advanced Batch Computing at Scale with Analytics Algorithms Predictive analytics (Linear regression, logistic regression) Using Gradient descent for scalability Distance based clustering models K-Means, Canopy, Fuzzy k-means, Streaming k-means Tree models, Random Forest model, Expectation-Maximization model Social Network Analytics Feature selections (PCA, correlation based, entropy based) W1.A.17 W1.A.18 Course Topics (2/2) Grading Policy Frameworks for real-time analytics Case study with Apache Storm Scalable NoSQL storage system and algorithms Distributed Hash Table with a case study of Apache/Facebook Cassandra In-memory storage systems with a case study of Redis and Tachyon Framework for the graph data analytics With a case study of Google s Pregel Multi-framework model for large scale data management With a case study of Facebook s f4 Category Points Percentage of Final Grade Programming Assignments (20% x 2) 400 40 % Term Project 300 30 % Quizzes and participation 100 10 % Midterm Exam 200 20% 3

W1.A.19 W1.A.20 What will be the assignments? Programming Assignments Programming Assignment 1 Implementing PageRank algorithm over Wikipedia pages using MapReduce Due on 9/25 Programming Assignment 2 Implementing real-time Twitter stream analysis using Apache Storm Due on 11/6 W1.A.21 W1.A.22 How do I submit the assignments? All of the programming assignments are individual submissions Submission should be via Canvas Late policy for the assignment submissions Up to a maximum of 2 day past the deadline. 10% penalty per day will be applied (Please no waiver requests) Each student will provide a demo of the programming assignment in CSB120 Each assignment will account for 20% of total score for the course Term Project W1.A.23 W1.A.24 Objectives and Requirements of Term Project Students identify Big Data topics for their term project Problem should contain at least one Big Data challenge Students provide a methodology to solve their problem Solution should be scalable Students implement a software solution Software should be functional Students provide an evaluation scheme for their software The success of your project should be quantifiable Term Project Grading Deliverable 1 (Proposal): 30% Deliverable 2 (Final including report and software): 40% Deliverable 3 (Presentation): 30% Late policy for deliverable submissions Up to a maximum of 2 days past the deadline. 10% penalty per day will be applied 4

W1.A.25 W1.A.26 Deliverables of Term Project Deliverable 1 (due on 10/1). Term project proposal 30% of term project score Project proposal Topic Motivation Literature survey Methodology Timeline Evaluation method Team management Student presentations (10 minutes per team) 10/13 and 10/15 in the class Deliverables of Term Project Deliverable 2. Your software, demonstration, and final report Due on 11/30 40% of term project score Deliverable 3. Presentation 30% of term project score You are strongly recommended to work on the term project as a team (2 or 3 team members) However, under rare circumstances you are allowed to work as an individual Please contact me and explain why you wish to work alone Please email your team by August 28th. W1.A.27 W1.A.28 Step 1. Which Big Data problems interest you? Example Area A Real-time Big Data Analytics Are you interested in a streaming data analysis? Are you interested in Twitter message analysis? Example Area B Iterative Data Analytics Are you interested in graph data analysis? Are you interested in social network analysis? Example Area C Predictive Data Analytics Are you interested in predicting values? Are you interested in recommendation systems? Example Area D Visual Data Analytics Are you interested in visualizing very large datasets? Step 2. Is there any other student(s) to work with? Find one or two You can choose one or combine two of the above areas W1.A.29 W1.A.30 Step 3. Narrow down your topic The more specific the problem is, the better the chance for completion and success Step 4. Is it feasible? Do you have a relevant dataset? Do you have functional tools or environment to support the needed analysis? Is the workload reasonable? 5

W1.A.31 W1.A.32 Previous Term Projects in CS535 Supporting Emergency Response During Natural Disasters with Twitter Data Winning Words in U.S. Supreme Court legal arguments Efficient Sequence Alignment and Similarity Searches Processing Smart Grid Data In Real Time (DEBS grand challenge 2014) Previous Term Projects in CS535 Time to Answer for Questions on stackoverflow.com using Map Reduce Analysis of words for spell-checking in search queries using digitized books and articles Efficient Boolean Symmetric Searchable Encryption Big Data I/O Performance Improvement Using a Buffered B-Tree Algorithm Node and Metadata Visualization in a Distributed Hash Table Who is Building Wikipedia? W1.A.33 W1.A.34 Quizzes and Participation There will be 10+ pop quizzes in class 1-2 simple questions Open-book, open-material, not open-internet Quizzes and Participation Minute survey W1.A.35 W1.A.36 Midterm 11/19/2015 in class 20% of the course grade Midterm Exam 6

W1.A.37 W1.A.38 Final Letter Grade Final Letter Grades and Course Policy Letter Grade Total Percentage A 90.00 % and higher B 80.00 ~ 89.99 % C 70.00 ~ 79.99 % D 60.00 ~ 69.99 % F Below 60.00 % W1.A.39 W1.A.40 Course Policy No make-up for missed quizzes and exams Except for the case where student provided an advance written notice to the instructor based on an emergency Supporting paper work must be submitted with the request Two lowest quiz scores will be dropped at the end of the semester Course Policy No Cell-phones in the class. No Laptops in the class. If you need to use a laptop during lectures, please sit in the back row. I will ask you to turn off your laptop if needed. W1.A.41 W1.A.42 More importantly, Attend the class, ask questions, and discuss Check the course web page and RamCT regularly Try new technologies and apply them Share your experiences with other students in class Questions? 7