Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Similar documents
Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data Analytics Process & Building Blocks

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Scaling Up HBase, Hive, Pegasus

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Text Analytics (Text Mining)

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Reducing Usage on a Service Plan

Google Product. Google Module 1

COMP9321 Web Application Engineering

REAL ESTATE TECH TRENDS

Building HTML5 and hybrid mobile apps using cloud services. Andrei Glazunov

The Need for Training in Big Data: Experiences and Case Studies

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline. We ll learn: Faloutsos CMU SCS

How to use the Cloud

RESTful or RESTless Current State of Today's Top Web APIs

HOW TO IMPROVE YOUR WEBSITE FOR BETTER LEAD GENERATION Four steps to auditing your site & planning improvements

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Big Data and Analytics: Challenges and Opportunities

Information Management course

Big Data Spatial Analytics An Introduction

Mobile App Framework For any Website

Computer Forensics Application. ebay-uab Collaborative Research: Product Image Analysis for Authorship Identification

Cut The TV Cable. Paul Glattstein

Shafiq Khan. An Introduction to. Cloud Computing 13/12/2012

Wireless Presentation Gateway. User Guide

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Searching the Social Network: Future of Internet Search?

Millennial Teens: Non-Conformist Trendsetters

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

ITP 140 Mobile Technologies. Mobile Topics

INTRODUCTION TO THE WEB

What is your age? Are you? Male Female

The Age of BYOD A study of personal content streaming vs. video-on-demand for the hospitality industry

Mobile Learning Apps. Distance Learning (860) Founders 131/131A Middlesex Community College

Social Media Marketing UCSB Extension

THEODORA TITONIS VERACODE Vice President Mobile

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

ONLINE ACCOUNTABILITY FOR EVERY DEVICE. Quick Reference Guide V1.0

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Clarity High School Student Survey

Impressive Analytics

MARKETING KUNG FU SEO: Key Things to Expand Your Digital Footprint. A Practical Checklist

White Paper: Datameer s User-Focused Big Data Solutions

Big Data in Telco & Banking Analytics. Benjamin Sznajder IBM Research Haifa

SEO, Search Engine and Online Reputation Management

ITEC 101 Introduction to Information Technology

How Using Big Data in Security Helps (and Hurts) Us

The Best Mobile App Development Platform. Period.

media kit 2014 PUBLISH / DEVELOP Global Mobile Ad Network

Developing and deploying mobile apps

2010 Brightcove, Inc. and TubeMogul, Inc Page 2

#mstrworld. No Data Left behind: 20+ new data sources with new data preparation in MicroStrategy 10

Step One Create a featured listing landing page on your website. Step Two Name your photos of the listing with the full property address

STATE OF THE MEDIA: CONSUMER USAGE REPORT

Web review of yoast.com

Getting Started with Oracle Data Miner 11g R2. Brendan Tierney

Website, Blogs, Social Sites : Create web presence in the world of Internet rcchak@gmail.com, June 21, 2015.

CUSTOM MOBILE APP DEVELOPMENT & LICENSING

Using HP ArcSight API for data visualization

Business Challenges and Research Directions of Management Analytics in the Big Data Era

Cloud Computing. Chapter 2 Software as a Service (SaaS)

Grow Your Business wi w t i h a a Mobil i e l A p A p

Introduction to Web Science

Digital marketing strategy

Web 2.0 Technology Overview. Lecture 8 GSL Peru 2014

Online Marketing Module COMP. Certified Online Marketing Professional. v2.0

Doing Multidisciplinary Research in Data Science

NetDania. Platforms and Services by. NetStation The flexible and feature-rich cloud trading platform

Big Data: Tools and Technologies in Big Data

Online Marketing Strategies

Social Influence Analysis in Social Networking Big Data: Opportunities and Challenges. Presenter: Sancheng Peng Zhaoqing University

Enter Here -->>> App Store Tracking, Track your Rankings - AppStoreShark.com Scam or Work? Visit Here

DIGITAL MARKETING. The Page Title Meta Descriptions & Meta Keywords

How To Use Big Data For Business

Guest Quick Guide PC and Mac Users Updated to version March 2015

Transcription:

Big Data Analytics Building Blocks; Simple Data Storage (SQLite) Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Jan 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

What is Data & Visual Analytics? 2

What is Data & Visual Analytics? No formal definition! 2

What is Data & Visual Analytics? No formal definition! Polo s definition: the interdisciplinary science of combining computation techniques and interactive visualization to transform and model data to aid discovery, decision making, etc. 2

What are the ingredients? 3

What are the ingredients? Need to worry (a lot) about storage, complex system design, scalability of algorithms, visualization techniques, etc.# # Used to be simpler before the big data era (why?) 3

What is big data? Why care? Many companies businesses are based on big data (Google, Facebook, Amazon, Apple, Symantec, LinkedIn, and many more)# Web search" Rank webpages (PageRank algorithm)# Predict what you re going to type# Advertisement (e.g., on Facebook)# Infer users interest; show relevant ads# Infer what you like, based on what your friends like# Recommendation systems (e.g., Netflix, Pandora, Amazon)# Online education# Health IT: patient records (EMR)# Bio and Chemical modeling: # Finance # Cybersecruity

Good news! Many big data jobs What jobs are hot? # Data scientist " Emphasize breadth of knowledge# This course helps you learn some of the skills

Big data analytics process and building blocks

Collection Cleaning Integration Analysis Visualization Presentation Dissemination

Process, not steps Collection Cleaning Integration Analysis Visualization Presentation Dissemination Can skip some" Can go back (two-way street)" Examples# Data types inform visualization design# Data informs choice of algorithms# Visualization informs data cleaning (dirty data)# Visualization informs algorithm design (user finds that results don t make sense)

How big data affects the process? Collection Cleaning Integration Analysis Visualization Presentation Dissemination The 4V of big data# Volume: billions, petabytes are common# Velocity: think Twitter, fraud detection, etc.# Variety: text (webpages), video (e.g., youtube), etc.# Veracity: uncertainty of data http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Schedule Collection Cleaning Integration Analysis Visualization Presentation Dissemination

Two example analytics processes

NetProbe: Fraud Detection in Online Auction# WWW 2007 NetProbe http://www.cs.cmu.edu/~dchau/papers/p201-pandit.pdf

NetProbe: The Problem Find bad sellers (fraudsters) on ebay who don t deliver their items $$$ Seller Buyer Auction fraud is #3 online crime in 2010 source: www.ic3.gov!13

!14

NetProbe: Key Ideas Fraudsters fabricate their reputation by trading with their accomplices# Fake transactions form near bipartite cores# How to detect them?!15

NetProbe: Key Ideas Use Belief Propagation F A H Fraudster Accomplice Darker means more likely Honest!16

NetProbe: Main Results!17

!18

!18

Belgian Police!18

!19

What analytics process does NetProbe go through? Collection Scraping Cleaning Integration Analysis Design detection algorithm Visualization Presentation Dissemination Paper, talks, lectures Not released

Discovr movie app

What analytics process would you go through to build the app? Collection Cleaning IMDB, Rotten tomatoes, youtube May have duplicate trailers Integration Analysis Determine which movies are related Visualization Presentation Dissemination Mac app, ios app

Homework 1 (out next Tue) Collection Cleaning Integration Analysis Visualization Presentation Dissemination Simple End-to-end analysis Collect data from Rotten Tomatoes (using API) Movies (Actors, directors, related movies, etc.) Store in SQLite database Transform data to movie-movie network Analyze, using SQL queries (e.g., create graph s degree distribution) Visualize, using Gephi Describe your discoveries

Data Collection, Simple Storage (SQLite) & Cleaning

Today: Data Collection, Simple Storage (SQLite) & Cleaning How to get data?# Low effort Download (where?)# API# Scrape/Crawl, or from equipment (e.g., sensors) High effort 26

Data you can just download Yahoo Finance (csv)# StackOverflow (xml)# Yahoo Music (KDD cup)# Atlanta crime data (csv)# Soccer statistics 27

Data via API CrunchBase (database about companies) - JSON# Twitter# Last.fm (Pandora has API?)# Flickr# Facebook# Rotten Tomatoes# itunes 28

Data that needs scraping Amazon (reviews, product info)# ESPN# Google Scholar# (ebay?) 29

Most popular embedded database in the world# iphone (ios), Android, Chrome (browsers), Mac, etc.# Self-contained: one file contains data + schema# Serverless: database right on your computer# Zero-configuration: no need to set up!# # http://www.sqlite.org# http://www.sqlite.org/different.html 30

How does it work? >sqlite3 database.db! # sqlite> create table student(ssn integer, name text);! sqlite>.schema! CREATE TABLE student(ssn integer, name text); ssn name 31

How does it work? insert into student values(111, "Smith");! insert into student values(222, "Johnson");! insert into student values(333, "Obama");! select * from student;! ssn name 111 Smith 222 Johnson 333 Obama 32

How does it work? create table takes (ssn integer, course_id integer, grade integer); ssn course_id grade 33

How does it work? More than one tables - joins# E.g., create roster for this course ssn name ssn course_id grade 111 Smith 222 Johnson 333 Obama 111 6242 100 222 6242 90 222 4000 80 34

How does it work? select name from student, takes where student.ssn = takes.ssn and takes.course_id = 6242;! ssn name ssn course_id grade 111 Smith 222 Johnson 333 Obama 111 6242 100 222 6242 90 222 4000 80 35

SQL General Form select a1, a2,... an from t1, t2,... tm where predicate [order by...] [group by...] [having...]! 36

Find ssn and GPA for each student select ssn, avg(grade) from takes group by ssn; ssn course_id grade ssn avg(grade) 111 6242 100 222 6242 90 111 100 222 85 222 4000 80 37

What if slow? Build an index to speed things up. SQLite implements B-tree. Speed improves from O(N) if to do a sequential scan to O(logN).# create index student_ssn_index on student(ssn); 38

Homework 1 Write simple scripts to import Rotten Tomatoes data into SQLite, and do some simple queries. http://developer.rottentomatoes.com/docs/read/json/v10/movie_info 39