Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Similar documents
Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Big Data Analytics Process & Building Blocks

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Scaling Up HBase, Hive, Pegasus

Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students.

Text Analytics (Text Mining)

70% 92% Managing Your Online Reputation. Review Sites. Social. Mobile 18/01/2013. Facebook has more than 1 billion active users

REAL ESTATE TECH TRENDS

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Reducing Usage on a Service Plan

Google Product. Google Module 1

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

Your Business s Online Check-Up

Cut The TV Cable. Paul Glattstein

Shafiq Khan. An Introduction to. Cloud Computing 13/12/2012

The Age of BYOD A study of personal content streaming vs. video-on-demand for the hospitality industry

Impressive Analytics

How to use the Cloud

media kit 2014 PUBLISH / DEVELOP Global Mobile Ad Network

Building HTML5 and hybrid mobile apps using cloud services. Andrei Glazunov

App Reputation Report February 2013 The Authority in App Security

Search Related Trust Incentives in Online Marketplaces

social media - mobile media MediaPro Burgdorf / Olli Alm Olli.Alm@metropolia.fi

INTRODUCTION TO THE WEB

What is your age? Are you? Male Female

Wireless Presentation Gateway. User Guide

COMP9321 Web Application Engineering

Impact Level HootSuite does not yet have an Impact Level accreditation, however if we were to apply we believe we would be at the IL3 level.

Clarity High School Student Survey

Computer Forensics Application. ebay-uab Collaborative Research: Product Image Analysis for Authorship Identification

The Need for Training in Big Data: Experiences and Case Studies

Online Marketing Module COMP. Certified Online Marketing Professional. v2.0

Search Engine Optimization Report (BTW 490: Workplace Writing in a Digital Age)

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline. We ll learn: Faloutsos CMU SCS

Social Media Marketing UCSB Extension

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Enter Here -->>> App Store Tracking, Track your Rankings - AppStoreShark.com Scam or Work? Visit Here

Design and Development, changed forever for now. Mobile First

Teenagers and Young Adults, aged 15-29, throughout the United States

U.S. Mobile Benchmark Report

HOW TO IMPROVE YOUR WEBSITE FOR BETTER LEAD GENERATION Four steps to auditing your site & planning improvements

Developing and deploying mobile apps

Guest Quick Guide PC and Mac Users Updated to version March 2015

A 10 MINUTE OVERVIEW OF KEY FEATURES FOR EVENT MOBILE APPS.

Big Data and Analytics: Challenges and Opportunities

Using Social Media to Improve Your SEO

beyond borders... Mobility Services

Information Management course

Mobile Learning Apps. Distance Learning (860) Founders 131/131A Middlesex Community College

RESTful or RESTless Current State of Today's Top Web APIs

Website, Blogs, Social Sites : Create web presence in the world of Internet rcchak@gmail.com, June 21, 2015.

STATE OF THE MEDIA: CONSUMER USAGE REPORT

Entrepreneurship Center

Digital marketing strategy

Grow Your Business wi w t i h a a Mobil i e l A p A p

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Online Marketing Strategies

CUSTOM MOBILE APP DEVELOPMENT & LICENSING

Clarity Middle School Survey

Mobile App Framework For any Website

Business Challenges and Research Directions of Management Analytics in the Big Data Era

PressRelease For Immediate Release

ONLINE ACCOUNTABILITY FOR EVERY DEVICE. Quick Reference Guide V1.0

Wi-Fi: MIL_Guest Password: Volkswag3n. Gregg Burkhalter Digital Marketing Consultant

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks

Title/Description/Keywords & Various Other Meta Tags Development

SEO, Search Engine and Online Reputation Management

Brand Queries As A Ranking Signal

Millennial Teens: Non-Conformist Trendsetters

Web 2.0 Technology Overview. Lecture 8 GSL Peru 2014

MY DIGITAL PLAN MY DIGITAL PLAN BROCHURE

Harnessing Digital. November 2014

INTRODUCTION TO WEB TECHNOLOGY

How Using Big Data in Security Helps (and Hurts) Us

More information >>> HERE <<<

Pre- launch Buzz Makes or Breaks Your App s Success

The Best Mobile App Development Platform. Period.

2010 Brightcove, Inc. and TubeMogul, Inc Page 2

ITP 140 Mobile Technologies. Mobile Topics

The State Of Mobile Apps

CUSTOM MOBILE APP DEVELOPMENT & LICENSING

Searching the Social Network: Future of Internet Search?

Minimum Requirements for Web Based Applications

About Blue Sky Sessions

Getting Started with Oracle Data Miner 11g R2. Brendan Tierney

DEVELOPING A COMPREHENSIVE E-MARKETING STRATEGY USING 3 POPULAR ONLINE CHANNELS

CIO SURVEY. Sponsored by. The Education Industry in 60 seconds

Masters Series: The New Demands of Online Video

Introduction to Web Science

Online Marketing for Franchisees One Less Thing To Worry About

120 Broadus Ave, Greenville, SC (864) DOM360.com. The Composition of a

2. The New Economy. Outline. Marketing Challenges. Connecting with Customers Directly

Social Intelligence Report Adobe Digital Index Q2 2015

MARKETING KUNG FU SEO: Key Things to Expand Your Digital Footprint. A Practical Checklist

MIS & CS -WORKING TOGETHER TO DEVELOP MOBILE APPS

How To Protect Your Network From Threats From Your Network (For A Mobile) And From Your Customers (For An Enterprise)

Doing Multidisciplinary Research in Data Science

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Big Data Analytics Building Blocks. Simple Data Storage (SQLite) Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

What is Data & Visual Analytics? 2

What is Data & Visual Analytics? No formal definition! 2

What is Data & Visual Analytics? No formal definition! Polo s definition: the interdisciplinary science of combining computation techniques and interactive visualization to transform and model data to aid discovery, decision making, etc. 2

What are the ingredients? 3

What are the ingredients? Need to worry (a lot) about: storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc. Used to be simpler before this big data era. Why? 3

What is big data? Why care? ( big data is buzz word, so is IoT - Internet of Things)

(Fall 14) What is big data? Why care? Many companies businesses are based on big data (Google, Facebook, Amazon, Apple, Symantec, LinkedIn, and many more) Web search Rank webpages (PageRank algorithm) Predict what you re going to type Advertisement (e.g., on Facebook) Infer users interest; show relevant ads Infer what you like, based on what your friends like Recommendation systems (e.g., Netflix, Pandora, Amazon) Online education Health IT: patient records (EMR) Bio and Chemical modeling: Finance Cybersecruity Internet of Things (IoT)

Good news! Many big data jobs What jobs are hot? Data scientist Emphasize breadth of knowledge This course helps you learn some important skills

Big data analytics process and building blocks

Collection Cleaning Integration Analysis Visualization Presentation Dissemination

Building blocks, not steps Collection Cleaning Integration Analysis Visualization Presentation Dissemination Can skip some Can go back (two-way street) Examples Data types inform visualization design Data informs choice of algorithms Visualization informs data cleaning (dirty data) Visualization informs algorithm design (user finds that results don t make sense)

How big data affects the process? Collection Cleaning Integration Analysis Visualization Presentation Dissemination The 4V of big data (now 7Vs) Volume: billions, petabytes are common Velocity: think Twitter, fraud detection, etc. Variety: text (webpages), video (e.g., youtube), etc. Veracity: uncertainty of data http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://dataconomy.com/seven-vs-big-data/

Schedule Collection Cleaning Integration Analysis Visualization Presentation Dissemination

Two analytics examples

NetProbe: Fraud Detection in Online Auction WWW 2007 NetProbe http://www.cs.cmu.edu/~dchau/papers/p201-pandit.pdf

NetProbe: The Problem Find bad sellers (fraudsters) on ebay who don t deliver their items $$$ Seller Buyer Auction fraud is #3 online crime in 2010 source: www.ic3.gov 14

15

NetProbe: Key Ideas Fraudsters fabricate their reputation by trading with their accomplices Fake transactions form near bipartite cores How to detect them? 16

NetProbe: Key Ideas Use Belief Propagation F A H Fraudster Accomplice Darker means more likely Honest 17

NetProbe: Main Results 18

19

19

Belgian Police 19

20

What analytics process does NetProbe go through? Collection Scraping (built a scraper / crawler ) Cleaning Integration Analysis Design detection algorithm Visualization Presentation Dissemination Paper, talks, lectures Not released

Discovr app (iphone & ipad) https://itunes.apple.com/us/app/discovr-discover-new-music/id412768094?mt=8

What analytics process would you go through to build the app? Collection Cleaning Integration Analysis Visualization Presentation Dissemination

Homework 1 (out next week) Collection Cleaning Integration Analysis Visualization Presentation Dissemination Simple End-to-end analysis Collect data from Rotten Tomatoes (using API) Movies (Actors, directors, related movies, etc.) Store in SQLite database Transform data to movie-movie network Analyze, using SQL queries (e.g., create graph s degree distribution) Visualize, using Gephi Describe your discoveries

Data Collection. How? Download Low effort API Scrape/Crawl High effort 26

Data you can just download Yahoo Finance (csv) StackOverflow (xml) Yahoo Music (KDD cup) Atlanta crime data (csv) Soccer statistics More on course website: http://poloclub.gatech.edu/cse6242/2015fall/#datasets 27

Data via API Twitter (small subset) https://dev.twitter.com/streaming/overview Last.fm (Pandora has unofficial API) Flickr Facebook (your friends only) CrunchBase (database about companies) Rotten Tomatoes not free anymore :-( itunes More on course website: http://poloclub.gatech.edu/cse6242/2015fall/#datasets 28

Data that needs scraping Amazon (reviews, product info) ESPN Google Scholar ebay Google Play More on course website: http://poloclub.gatech.edu/cse6242/2015fall/#datasets 29

How to Scrape? Google Play example Goal: build network of similar apps 30

Most popular embedded database in the world iphone (ios), Android, Chrome (browsers), Mac, etc. Self-contained: one file contains data + schema Serverless: database right on your computer Zero-configuration: no need to set up! http://www.sqlite.org http://www.sqlite.org/different.html 31

How does it work? >sqlite3 database.db sqlite> create table student(ssn integer, name text); sqlite>.schema CREATE TABLE student(ssn integer, name text); ssn name 32

How does it work? insert into student values(111, "Smith"); insert into student values(222, "Johnson"); insert into student values(333, "Obama"); select * from student; ssn name 111 Smith 222 Johnson 333 Obama 33

How does it work? create table takes (ssn integer, course_id integer, grade integer); ssn course_id grade 34

How does it work? More than one tables - joins E.g., create roster for this course ssn name ssn course_id grade 111 Smith 222 Johnson 333 Obama 111 6242 100 222 6242 90 222 4000 80 35

How does it work? select name from student, takes where student.ssn = takes.ssn and takes.course_id = 6242; ssn name ssn course_id grade 111 Smith 222 Johnson 333 Obama 111 6242 100 222 6242 90 222 4000 80 36

SQL General Form select a1, a2,... an from t1, t2,... tm where predicate [order by...] [group by...] [having...] 37

Find ssn and GPA for each student select ssn, avg(grade) from takes group by ssn; ssn course_id grade ssn avg(grade) 111 6242 100 222 6242 90 111 100 222 85 222 4000 80 38

What if slow? Build an index to speed things up. SQLite s indices use B-tree data structure. O(logN) speed for adding/finding/deleting an item create index student_ssn_index on student(ssn); 39