2 Today General course overview Q&A Introduction to Big Data Data Collection Assignment #1
3 General Course Information Course Web Page also on canvas Text & References No official textbook will use online resources and papers Apt/CHPC accounts request asap TA - Anusha Buchireddygari Office hours
4 Class Interaction I strongly encourage discussions & interactions Extra credits for strong participation
5 Assignments & Projects Assignment 1 Due in 1 week 10% credit Assignment 2 15% credit 3 weeks Assignment 3 15% credit 2 weeks Assignment 4 15% credit 2 weeks Assignment 5 15% credit 2 weeks Final Project 30% credit 6 weeks Presentations during exam week 3 parts Easy Medium Real world problem Groups of 2
6 Things you should know No formal prerequisite Good Programming Skills (any language) C/C++/Java preferred for assignments (except Assignment 1) Make a case if you would like to use any other language Be prepared to learn new programming tools & techniques Git create account on github Sequential algorithms, complexity analysis
7 This is a systems course Big data is a broad concept that covers several aspects of computer science We shall focus on the computer systems aspect e.g. How various parts of a big data computer system(hardware, software and applications) are put together? What are the appropriate approaches to realize high-performance, scalability, reliability, and security in practical big data systems? Not the right course if you are expecting to learn about algorithmic design and theoretical foundations for machine learning & data mining.
8 What will we cover? Data Collection Parallel Algorithms Hadoop & mapreduce Analytics beyond Hadoop Spark Graph Algorithms - GraphLib Real-time analytics - Storm MPI Other solutions Data Storage Data Security, Privacy Data visualization Big Data Applications
9 Things that you need to do now Get a github account and join UtahCS6965 Request a chpc account send to TA with uid Use git commit often
10 Where to look for help Course website / Canvas TA office hours MEB 3XXX Mon,Thu 3-5pm My office hours MEB 3454 Tue 12-2pm Questions?
11 What is Big Data?
13 What is Big Data? Volume Variety Veracity Velocity
14 The recognition that data is at the center of our digital world and that there are big challenges in collecting, storing, processing, analyzing, and making use of such data. What is Big depends on the application domain
15 Kinds of Data Web data & web data accesses s, chats, tweet Telephone data Public databases Gene banks, census records, Private databases medical records, credit card transactions, Sensor data camera surveillance, wearable sensors, seismograms Byproduct of computer systems operations power signal, CPU events,
16 Data is valuable Google & facebook make money by mining user data for revenue from advertisements Financial firms analyze financial records, real-time transactions and current events for profitable trades Medical records can be processed for better and potentially cheaper health care Roadways are monitored for traffic analysis and control Face detection in airports
17 Collection of Big Data The amount of data available is increasing exponentially But, it is still challenging to collect it Difficult to get access Redundancy Noise Example Web data collection Assignment 1
18 Processing & Analysis of Big Data Large datasets do not fit in memory Processing large datasets is time consuming Parallel processing is necessary challenging Message Passing Interface (MPI) (1991) Low(er) level C/Fortran API for communication Powerful, hard(er) to code,!fault tolerant Mapreduce Google (2004) Originated from web data processing ease of programming, fault tolerant limited semantics Spark, GraphLab, Storm,
19 Storage & I/O Storage and I/O are critical for big data performance and reliability Hardware: disks, Flash, SSD, nonvolatile memory, 3D memory Parallelism: RAID, parallel data storage, DFS Data durability and consistency
20 Data Centers Racks of machines & Storage Data is growing faster than storage Energy consumption & cooling are big concerns Being built in colder areas and closer to cheap(er) energy sources
21 Energy Energy efficiency in data centers Huge financial and environmental issue Data center construction from low-power computers Think of a stack of tablets Final project Data centers on renewable energy Hydro, wind, solar Stampede cooling
22 Data privacy & protection Misuse of big data is a big concern A person s online activities can reveal all aspects of the person s life Systems need to provide clear guidelines on data privacy and protection Sensitive clinical information Understand how the big data world operates as an user as a developer
23 Data Collection
24 Challenges of Big Data Collection Challenging to acquire a lot of data Resources memory, bandwidth, processing power Parallelism What is useful? Filter while acquiring Identify redundancy Collect topic-specific Challenging to collect from distributed sources
25 Web Crawling Collect published web content Start with representative page Parse content and follow hyperlinks Recurse Why? Search engines Advertisements Extract additional information beyond what is provided by search engines
26 Goals Collect good-quality pages (content) Filter on the fly (low-content, redundancy) Parallel Crawl efficiently (low client resources) Crawl quietly (low server resources) Avoid parsing the same page more than once Predict page quality before retrieval
27 Redundancy removal Avoid parsing & following the same page more than once Record all URLs that have been parsed & followed Same page might have different URLs URL normalization Host portion is case insensitive Resolve percent encoding URLs that look completely different may also be the same page Compute and match page content checksum
28 Other considerations Assume crawling is limited by some resource Total time Bandwidth / local storage Depth-first search or Breadth-first search Hyperlink with a high probability of pointing to a high-quality page High in-links Pointed from high-quality pages (BFS) Spam identification Topic-specific crawling Infer topic from text surrounding the link, keywords
29 Variations Save only images Students interested in image processing Save all images on a topic Save pdf files Collect academic papers on a given topic Audio, video, etc
30 Scalable web crawling Resources CPU network operations, parsing the page content I/O write to disk Network bandwidth Memory? Parallel Web crawling Use multiple CPUs Use multiple storage devices Challenges Synchronization Network bandwidth bottleneck
31 Crawler etiquette Crawling is not always welcome Robot exclusion standard (robots.txt) Have a polite mode Limit crawler bandwidth Limit per site
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
1 Contents Introduction. 1 View Point Phil Shelley, CTO, Sears Holdings Making it Real Industry Use Cases Retail Extreme Personalization. 6 Airlines Smart Pricing. 9 Auto Warranty and Insurance Efficiency.
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R B i g D a t a : W h a t I t I s a n d W h y Y o u S h o u l d C a r e Sponsored
Trends in Cloud Computing and Big Data Nikita Bhagat, Ginni Bansal, Dr.Bikrampal Kaur email@example.com, firstname.lastname@example.org, email@example.com Abstract - BIG data refers to the
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-03, Issue-05, pp-266-270 www.ajer.org Research Paper Open Access Convergence of Big Data and Cloud Sreevani.Y.V.
MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic Declaration Hereby I declare that this paper is my original
Front cover Building Big Data and Analytics Solutions in the Cloud Characteristics of big data and key technical challenges in taking advantage of it Impact of big data on cloud computing and implications
Business innovation and IT trends If you just follow, you will never lead Contents Executive summary 4 Background: Innovation and the CIO agenda 5 Cohesion and connection between technology trends 6 About
Emergence and Taxonomy of Big Data as a Service Benoy Bhagattjee Working Paper CISL# 2014-06 May 2014 Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E62-422 Massachusetts
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
International Journal of Education and Research Vol. 1 No. 5 May 2013 The Big Data opportunity for the retail industry Online desk research Adina Săniuţă Academy of Economic Studies, Bucharest, Romania
Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success June, 2013 Contents Executive Overview...4 Business Innovation & Transformation...5 Roadmap for Social, Mobile and Cloud Solutions...7
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania firstname.lastname@example.org, email@example.com
June 2001 Broadcasting Virtual Games in the Internet by Martin Otten (firstname.lastname@example.org) Abstract This paper provides an introduction to current possibilities for broadcasting an online game to
Hasso-Plattner-Institut University of Potsdam Internet Technology and Systems Group Scalability and Performance Management of Internet Applications in the Cloud A thesis submitted for the degree of "Doktors
February 2009 Seeding the Clouds: Key Infrastructure Elements for Cloud Computing Page 2 Table of Contents Executive summary... 3 Introduction... 4 Business value of cloud computing... 4 Evolution of cloud
white paper Boosting Retail Revenue and Efficiency with Big Data Analytics A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring Abstract Contents
SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTING Venkata Narasimha Inukollu 1, Sailaja Arsi 1 and Srinivasa Rao Ravuri 3 1 Department of Computer Engineering, Texas Tech University, USA 3 Department
THE LYTX ADVANTAGE: USING PREDICTIVE ANALYTICS TO DELIVER DRIVER SAFETY BRYON COOK VP, ANALYTICS www.lytx.com 1 CONTENTS Introduction... 3 Predictive Analytics Overview... 3 Drivers in Use of Predictive
Eindhoven, August 2014 Big Data Opportunities for the Retail Sector A Model Proposal by M.G.H. (Marcel) van Eupen BSc Industrial Engineering & Management Science TU/e 2014 Student identity number 0715154
Green-Cloud: Economics-inspired Scheduling, Energy and Resource Management in Cloud Infrastructures Rodrigo Tavares Fernandes email@example.com Instituto Superior Técnico Avenida Rovisco
CIO Roundtable - Big March 13, 2013 Big and its Dimensions Big refers to internal and external data that is multi-structured, generated from diverse sources in near real-time and in large volumes making
http://www.cse.wustl.edu/~jain/cse567-06/ftp/net_traffic_monitors2/ind... 1 of 11 SNMP and Beyond: A Survey of Network Performance Monitoring Tools Paul Moceri, firstname.lastname@example.org Abstract The growing
fs viewpoint www.pwc.com/fsi 02 15 19 21 27 31 Point of view A deeper dive Competitive intelligence A framework for response How PwC can help Appendix Where have you been all my life? How the financial
ISSN (Online): 2409-4285 www.ijcsse.org Page: 78-85 A Survey of Big Data Cloud Computing Security Elmustafa Sayed Ali Ahmed 1 and Rashid A.Saeed 2 1 Electrical and Electronic Engineering Department, Red