Rolling the Dice on Big Data. Ilse Ipsen Department of Mathematics

Size: px
Start display at page:

Download "Rolling the Dice on Big Data. Ilse Ipsen Department of Mathematics"

Transcription

1 Rolling the Dice on Big Data Ilse Ipsen Department of Mathematics

2 The Economist, 27 February 2010

3 Science, 11 February 2011

4 McKinsey Global Institute, May 2011

5

6

7 Rolling the Dice on Big Data What is Big?

8 Measuring Units 1 byte 1 character 10 bytes 1 word 100 bytes 1 sentence 1 kilobyte = 1,000 bytes 1 page

9 1 byte 1 character 10 bytes 1 word 100 bytes 1 sentence Measuring Units 1 kilobyte = 1,000 bytes 1 page 1 megabyte = 1,000 kilobytes complete works of Shakespeare 1 gigabyte = 1,000 megabytes a big shelf full of books

10 1 byte 1 character 10 bytes 1 word 100 bytes 1 sentence Measuring Units 1 kilobyte = 1,000 bytes 1 page 1 megabyte = 1,000 kilobytes complete works of Shakespeare 1 gigabyte = 1,000 megabytes a big shelf full of books 1 terabyte = 1,000 gigabytes all books in the Library of Congress

11 1 byte 1 character 10 bytes 1 word 100 bytes 1 sentence Measuring Units 1 kilobyte = 1,000 bytes 1 page 1 megabyte = 1,000 kilobytes complete works of Shakespeare 1 gigabyte = 1,000 megabytes a big shelf full of books 1 terabyte = 1,000 gigabytes all books in the Library of Congress 1 petabyte = 1,000 terabytes 20 million 4-door filing cabinets full of text

12 1 byte 1 grain of sand

13 1 byte 1 grain of sand 1 terabyte number of grains to fill a swimming pool

14 Rolling the Dice on Big Data Data

15 Rolling the Dice on Big Data

16 Rolling the Dice on Big Data Not quite

17 The Data in this Talk Given: Database: Collection of documents (data points) Query: Single document (data point) Want: Documents closest to query A tiny example to illustrate a big data problem

18 A Tiny Data Example Database: s from known authors 1: shipment of gold damaged in a fire 2: delivery of silver arrived in a silver truck 3: shipment of gold arrived in a truck Query: from unknown author gold silver truck Which s match the query best? These s may give clues about the author of query Simplest approach for matching: Word frequency

19 Tabulating s and Query Database (term document matrix) + Query Terms Query a arrived damaged delivery fire gold in of silver shipment truck

20 Basic Approach for Finding Matching s 1 Common words For each Count number of words common to and Query

21 Basic Approach for Finding Matching s 1 Common words For each Count number of words common to and Query 2 Length Count number of words in each , and in Query

22 Basic Approach for Finding Matching s 1 Common words For each Count number of words common to and Query 2 Length Count number of words in each , and in Query 3 Matching score for each Matching score = Number of common words (Length of ) (Length of Query) s with highest matching scores: May give clues about authors of Query

23 Count Common Words in Query and 1 Terms E 1 Q Multiply a arrived damaged delivery fire gold in of silver shipment truck Sum 1 # common words in 1 and Query: E 1 Q = 1

24 Count Common Words in Query and 2 Terms E 2 Q Multiply a arrived damaged delivery fire gold in of silver shipment truck Sum 3 # common words in 2 and Query: E 2 Q = 3

25 Count Common Words in Query and 3 Terms E 3 Q Multiply a arrived damaged delivery fire gold in of silver shipment truck Sum 2 # common words in 3 and Query: E 3 Q = 2

26 Basic Approach for Finding Matching s 1 Number of words common to s and Query E 1 Q = 1 E 2 Q = 3 E 3 Q = 2 2 Length Count number of words in each , and in query

27 Length of Query Terms Q Square a 0 0 arrived 0 0 damaged 0 0 delivery 0 0 fire 0 0 gold 1 1 in 0 0 of 0 0 silver 1 1 shipment 0 0 truck 1 1 Sum 3 Length of Query: Q = 3 1.7

28 Length of 2 Terms E 2 Square a 1 1 arrived 1 1 damaged 0 0 delivery 1 1 fire 0 0 gold 0 0 in 1 1 of 1 1 silver 2 4 shipment 0 0 truck 1 1 Sum 10 Length of 2: E 2 =

29 Basic Approach for Finding Matching s 1 Number of words common to s and Query E 1 Q = 1 E 2 Q = 3 E 3 Q = 2 2 Length of s and Query Q = E 1 = E 2 = E 3 = 7 2.6

30 Matching Score for each Matching score = Number of common words (Length of ) (Length of query) E 1 Q E 1 Q = E 2 Q E 2 Q = E 3 Q E 3 Q = is the best match for the query

31 Conclusion for Tiny Data Example Database: s from known authors 1: shipment of gold damaged in a fire 2: delivery of silver arrived in a silver truck 3: shipment of gold arrived in a truck Query: from unknown author gold silver truck Best matching 2: delivery of silver arrived in a silver truck

32 The Reason for the Weird Way of Counting Vector Space Model s, Query = vectors Matching score = cosine of angle between and Query E Q = cos (E, Q) E Q

33 What this means in practice Average number of s per day: 294 billion Number words in English language: at least 250,000 Matching one query with a single 250,000 operations (one for every possible word) Matching one query with all s: 250,000 * 294 billion = operations

34 What this means in practice Average number of s per day: 294 billion Number words in English language: at least 250,000 Matching one query with a single 250,000 operations (one for every possible word) Matching one query with all s: 250,000 * 294 billion = operations Fast PC (Intel Core i7 980 XE) 109 Gflops = floating point operations per second Matching one query with all s: about 8 days

35 What this means in practice Average number of s per day: 294 billion Number words in English language: at least 250,000 Matching one query with a single 250,000 operations (one for every possible word) Matching one query with all s: 250,000 * 294 billion = operations Fast PC (Intel Core i7 980 XE) 109 Gflops = floating point operations per second Matching one query with all s: about 8 days US supercomputer (Cray XT5, Opteron quad core 2.3GHz) Peak 1,381,400 Gflops Matching one query with all s: about 1 minute

36 Can the Matching be Performed Faster?

37 Can the Matching be Performed Faster? Yes! Ralph Abbey, Sarah Warkentin, Sylvester Eriksson-Bique, Mary Solbrig, Michael Stefanelli

38 Rolling the Dice on Big Data Rolling the Dice

39 Rolling the Dice on Big Data Rolling the Dice on which words to use for the matching

40 Idea Randomized Query Matching Algorithm Do not use every word in query and s Monte Carlo Sampling: Use only selected words {Downsize to smaller database with fewer words}

41 Idea Randomized Query Matching Algorithm Do not use every word in query and s Monte Carlo Sampling: Use only selected words {Downsize to smaller database with fewer words} Justification Don t need exact matching scores Identify only s with highest matching scores Database available for offline computation Derive statistics based on word frequencies Perform query matching online Use statistics to select words used for matching

42 Suggestions for Downsizing the Database Statistics n: number of words in database Q j : frequency of word j in query W j : frequency of word j in database Suggestion for selecting word j Probability of sampling word j W j Q j p j = W 1 Q W n Q n Frequently occurring words more likely to be sampled

43 Rolling the Dice = Downsizing the Database User input s: number of samples {number of words in downsized database} Monte Carlo Sampling {Roll the dice s times} For t = 1,..., s Sample index j t from {1,..., n} with probability p jt independently and with replacement Downsized database contains only s words: word j 1, word j 2,..., word j s

44 Matching with Downsized Database Downsized database: word j 1, word j 2,..., word j s Word frequency in Query: ˆQ = ( Q j1 Q j2... Q js ) For each E:

45 Matching with Downsized Database Downsized database: word j 1, word j 2,..., word j s Word frequency in Query: ˆQ = ( Q j1 Q j2... Q js ) For each E: Word frequency Ê = ( F j1 F j2... F js )

46 Matching with Downsized Database Downsized database: word j 1, word j 2,..., word j s Word frequency in Query: ˆQ = ( Q j1 Q j2... Q js ) For each E: Word frequency Ê = ( F j1 F j2... F js ) Approximate number of words common to and Query C = 1 ( Fj1 Q j1 + F j 2 Q j2 + + F ) j s Q js s p j1 p j2 p js {s, p j1, p j2,..., p js compensate for fewer words}

47 Matching with Downsized Database Downsized database: word j 1, word j 2,..., word j s Word frequency in Query: ˆQ = ( Q j1 Q j2... Q js ) For each E: Word frequency Ê = ( F j1 F j2... F js ) Approximate number of words common to and Query C = 1 ( Fj1 Q j1 + F j 2 Q j2 + + F ) j s Q js s p j1 p j2 p js {s, p j1, p j2,..., p js compensate for fewer words} Approximate matching score of C Ê ˆQ

48 Reuters Collection: Transcribed Subset 201 documents and 5601 words Number of sampled words s = 56 1 percent Ranking Deterministic Uniform Deterministic q Bucket of computed 25 best matches contains Correct 10 best matches in 99% of all cases

49 Wikipedia Dataset 200 documents and 198,853 words Average percent of correct 10 best matches as function of sample size % correct rankings Number of samples, c Sampling 1% of the words gives correct 9 best matches. More sampling does not help a lot.

50 Big data Summary Matching queries against document database Rolling the dice But... Randomized downsizing of database vocabulary Frequently occurring words more likely to be kept

51 Big data Summary Matching queries against document database Rolling the dice Randomized downsizing of database vocabulary Frequently occurring words more likely to be kept But... Why not use a predictable (deterministic) algorithm? Why use a randomized algorithm?

52 Big data Summary Matching queries against document database Rolling the dice Randomized downsizing of database vocabulary Frequently occurring words more likely to be kept But... Why not use a predictable (deterministic) algorithm? Why use a randomized algorithm? Advantages of randomized algorithm Easy to analyze Fast, and simple to implement As good in practice as deterministic algorithm (for this type of application)

53 The Bigger Picture Many different methods for fast query matching Algorithm in this talk: Randomized matrix vector multiplication Other randomized matrix algorithms: Matrix multiplication Subset selection Least squares problems (regression) Low rank approximation (PCA) Applications for randomized algorithms: Social network analysis, population genetics, circuit testing,...

54 NSF Web National Science Foundation, 29 March 2012 Press Release NSF Leads Federal Efforts In Big Data At White House event, NSF Director announces new Big Data solicitation, $10 million Expeditions in Computing award, and awards in cyberinfrastructure, geosciences, training UC Irvine's HIPerW advances earth sc and visualization Credit and Larger Hurricane Ike visualization created by Texas Advanced Computing Center (TACC) supercomputer Ranger. Credit and Larger Version March 29, 2012

Using Patterns of Integer Exponents

Using Patterns of Integer Exponents 8.EE.1 Know and apply the properties of integer exponents to generate equivalent numerical expressions. How can you develop and use the properties of integer exponents? The table below shows powers of

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Computer Logic (2.2.3)

Computer Logic (2.2.3) Computer Logic (2.2.3) Distinction between analogue and discrete processes and quantities. Conversion of analogue quantities to digital form. Using sampling techniques, use of 2-state electronic devices

More information

BIG DATA, BIOBANKS AND PREDICTIVE ANALYTICS FOR A BETTER CLINICAL OUTCOME

BIG DATA, BIOBANKS AND PREDICTIVE ANALYTICS FOR A BETTER CLINICAL OUTCOME BIG DATA, BIOBANKS AND PREDICTIVE ANALYTICS FOR A BETTER CLINICAL OUTCOME Π. Ε. Βάρδας MD, PhD(London) DISCLOSURES My great love to innovative ideas BIG DATA It is a broad term for data sets, so large

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

lesson 1 An Overview of the Computer System

lesson 1 An Overview of the Computer System essential concepts lesson 1 An Overview of the Computer System This lesson includes the following sections: The Computer System Defined Hardware: The Nuts and Bolts of the Machine Software: Bringing the

More information

Availability Digest. www.availabilitydigest.com. Data Deduplication February 2011

Availability Digest. www.availabilitydigest.com. Data Deduplication February 2011 the Availability Digest Data Deduplication February 2011 What is Data Deduplication? Data deduplication is a technology that can reduce disk storage-capacity requirements and replication bandwidth requirements

More information

The Procedures of Monte Carlo Simulation (and Resampling)

The Procedures of Monte Carlo Simulation (and Resampling) 154 Resampling: The New Statistics CHAPTER 10 The Procedures of Monte Carlo Simulation (and Resampling) A Definition and General Procedure for Monte Carlo Simulation Summary Until now, the steps to follow

More information

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs 1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE

More information

An Introduction to Applied Mathematics: An Iterative Process

An Introduction to Applied Mathematics: An Iterative Process An Introduction to Applied Mathematics: An Iterative Process Applied mathematics seeks to make predictions about some topic such as weather prediction, future value of an investment, the speed of a falling

More information

PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D. PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD Natasha Balac, Ph.D. Brief History of SDSC 1985-1997: NSF national supercomputer center; managed by General Atomics

More information

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Mental Questions. Day 1. 1. What number is five cubed? 2. A circle has radius r. What is the formula for the area of the circle?

Mental Questions. Day 1. 1. What number is five cubed? 2. A circle has radius r. What is the formula for the area of the circle? Mental Questions 1. What number is five cubed? KS3 MATHEMATICS 10 4 10 Level 8 Questions Day 1 2. A circle has radius r. What is the formula for the area of the circle? 3. Jenny and Mark share some money

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

The Mysterious Cloud What s In It For Propane? Aaron Cargas acargas@cargas.com CargasEnergy.com Booth: 1339

The Mysterious Cloud What s In It For Propane? Aaron Cargas acargas@cargas.com CargasEnergy.com Booth: 1339 The Mysterious Cloud What s In It For Propane? Aaron Cargas acargas@cargas.com CargasEnergy.com Booth: 1339 Introduction Aaron Cargas VP of Marketing and Product Development at Cargas Systems Our Company

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 Paper 1886-2014 Recommending News s using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 1 GE Capital Retail Finance, 2 Warwick Business School ABSTRACT Predicting news articles

More information

HOW TO BECOME AN ESI HERO

HOW TO BECOME AN ESI HERO HOW TO BECOME AN ESI HERO taking the mystery out of ediscovery www.fxhnd.com info@fxhnd.com Electronically Stored Information Boo! But why do I have to learn about all this technology? It s how we communicate

More information

Using Ultra-Large Data Sets in Healthcare New Questions-New Answers

Using Ultra-Large Data Sets in Healthcare New Questions-New Answers Using Ultra-Large Data Sets in Healthcare New Questions-New Answers David Hartzband, D.Sc.. Director, Technology Research, RCHN Community Health Foundation & Lecturer, Engineering Systems Division Massachusetts

More information

MATHS ACTIVITIES FOR REGISTRATION TIME

MATHS ACTIVITIES FOR REGISTRATION TIME MATHS ACTIVITIES FOR REGISTRATION TIME At the beginning of the year, pair children as partners. You could match different ability children for support. Target Number Write a target number on the board.

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

An open source software project that enables the distributed processing of very large data sets across multiple servers Basically:

An open source software project that enables the distributed processing of very large data sets across multiple servers Basically: Big Data, Fine Print Business and Legal Considerations for Companies Dealing in Data Rachel Tarko Hudson, Technology Transactions Team & Privacy Team Sheppard Mullin Richter & Hampton LLP 2015 What is

More information

Applications for Business Intelligence, Predictive Analytics and Big Data

Applications for Business Intelligence, Predictive Analytics and Big Data Finance, Management, & Operations Applications for Business Intelligence, Predictive Analytics and Big Data Patrick Bogan, Chief Information Officer, Fuzion Analytics Kyle Korzenowski, Chief Information

More information

ELECTRONIC DOCUMENT IMAGING

ELECTRONIC DOCUMENT IMAGING AIIM: Association for Information and Image Management. Trade association and professional society for the micrographics, optical disk and electronic image management markets. Algorithm: Prescribed set

More information

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX Multiple Choice: 1. Processing information involves: A. accepting information from the outside world. B. communication with another computer. C. performing arithmetic

More information

Data Visualization for Atomistic/Molecular Simulations. Douglas E. Spearot University of Arkansas

Data Visualization for Atomistic/Molecular Simulations. Douglas E. Spearot University of Arkansas Data Visualization for Atomistic/Molecular Simulations Douglas E. Spearot University of Arkansas What is Atomistic Simulation? Molecular dynamics (MD) involves the explicit simulation of atomic scale particles

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Determining Your Computer Resources

Determining Your Computer Resources Determining Your Computer Resources There are a number of computer components that must meet certain requirements in order for your computer to perform effectively. This document explains how to check

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Maths Targets for pupils in Year 2

Maths Targets for pupils in Year 2 Maths Targets for pupils in Year 2 A booklet for parents Help your child with mathematics For additional information on the agreed calculation methods, please see the school website. ABOUT THE TARGETS

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Student Name GRADE. Mississippi Curriculum Test, Second Edition PRACTICE TEST BOOK MATHEMATICS

Student Name GRADE. Mississippi Curriculum Test, Second Edition PRACTICE TEST BOOK MATHEMATICS Student Name Mississippi Curriculum Test, Second Edition MCT2 GRADE 3 PRACTICE TEST BOOK MATHEMATICS Practice Test 3 for MCT2 is developed and under contract with the Mississippi Department of Education

More information

Flash Drives and File Management (Windows 7 and 8)

Flash Drives and File Management (Windows 7 and 8) Better Technology, Onsite and Personal Connecting NIOGA s Communities www.btopexpress.org www.nioga.org [Type Flash Drives and File Management (Windows 7 and 8) Overview: This class focuses on saving,

More information

Evidence to Action: Use of Predictive Models for Beach Water Postings

Evidence to Action: Use of Predictive Models for Beach Water Postings Evidence to Action: Use of Predictive Models for Beach Water Postings Canadian Society for Epidemiology and Biostatistics Caitlyn Paget, June 4 th 2015 Goal is to improve program delivery Can we improve

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

7 th Grade Math Foundations for Teaching Unit One: Numbers & Operations Module One: Rational Number s

7 th Grade Math Foundations for Teaching Unit One: Numbers & Operations Module One: Rational Number s Unit One: Numbers & Operations Module One: s Day/Date TEKS Activities and Resources Essential Question Aug 25 Aug 26 Aug 27 Aug 28 Aug 29 (SS) (SS) (SS) (SS) - First Day Procedures - Math Class Student

More information

The Central Processing Unit:

The Central Processing Unit: The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Challenges in e-science: Research in a Digital World

Challenges in e-science: Research in a Digital World Challenges in e-science: Research in a Digital World Thom Dunning National Center for Supercomputing Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I Victor Adamchi Danny Sleator Great Theoretical Ideas In Computer Science Probability Theory I CS 5-25 Spring 200 Lecture Feb. 6, 200 Carnegie Mellon University We will consider chance experiments with

More information

Doing Multidisciplinary Research in Data Science

Doing Multidisciplinary Research in Data Science Doing Multidisciplinary Research in Data Science Assoc.Prof. Abzetdin ADAMOV CeDAWI - Center for Data Analytics and Web Insights Qafqaz University aadamov@qu.edu.az http://ce.qu.edu.az/~aadamov 16 May

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

Solving Systems of Equations Introduction

Solving Systems of Equations Introduction Solving Systems of Equations Introduction Outcome (learning objective) Students will write simple systems of equations and become familiar with systems of equations vocabulary terms. Student/Class Goal

More information

Mass Storage Structure

Mass Storage Structure Mass Storage Structure 12 CHAPTER Practice Exercises 12.1 The accelerating seek described in Exercise 12.3 is typical of hard-disk drives. By contrast, floppy disks (and many hard disks manufactured before

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Calculating VaR. Capital Market Risk Advisors CMRA

Calculating VaR. Capital Market Risk Advisors CMRA Calculating VaR Capital Market Risk Advisors How is VAR Calculated? Sensitivity Estimate Models - use sensitivity factors such as duration to estimate the change in value of the portfolio to changes in

More information

Data-Intensive Science and Scientific Data Infrastructure

Data-Intensive Science and Scientific Data Infrastructure Data-Intensive Science and Scientific Data Infrastructure Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 13 April 2011 Overview Data-intensive science Publishing scientific

More information

Solving Systems of Linear Equations Putting it All Together

Solving Systems of Linear Equations Putting it All Together Solving Systems of Linear Equations Putting it All Together Outcome (lesson objective) Students will determine the best method to use when solving systems of equation as they solve problems using graphing,

More information

Queuing Theory. Long Term Averages. Assumptions. Interesting Values. Queuing Model

Queuing Theory. Long Term Averages. Assumptions. Interesting Values. Queuing Model Queuing Theory Queuing Theory Queuing theory is the mathematics of waiting lines. It is extremely useful in predicting and evaluating system performance. Queuing theory has been used for operations research.

More information

What is Big Data? The three(or four) Vs in Big Data In 2013 the total amount of stored information is estimated to be Volume.

What is Big Data? The three(or four) Vs in Big Data In 2013 the total amount of stored information is estimated to be Volume. 8/26/2014 CS581 Big Data - Fall 2014 1 8/26/2014 CS581 Big Data - Fall 2014 2 CS535/CS581A BIG DATA What is Big Data? PART 0. INTRODUCTION 1. INTRODUCTION TO BIG DATA 2. COURSE INTRODUCTION PART 0. INTRODUCTION

More information

A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

More information

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science Dr. Daisy Zhe Wang CISE Department University of Florida August 25th 2014 20 Review Overview of Data Science Why Data

More information

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL STATIsTICs 4 IV. RANDOm VECTORs 1. JOINTLY DIsTRIBUTED RANDOm VARIABLEs If are two rom variables defined on the same sample space we define the joint

More information

Definition of Computers. INTRODUCTION to COMPUTERS. Historical Development ENIAC

Definition of Computers. INTRODUCTION to COMPUTERS. Historical Development ENIAC Definition of Computers INTRODUCTION to COMPUTERS Bülent Ecevit University Department of Environmental Engineering A general-purpose machine that processes data according to a set of instructions that

More information

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic: Binary Numbers In computer science we deal almost exclusively with binary numbers. it will be very helpful to memorize some binary constants and their decimal and English equivalents. By English equivalents

More information

Congrats to Game Winners. How can computation use data to solve problems? What topics have we covered in CS 202? Part 1: Completed!

Congrats to Game Winners. How can computation use data to solve problems? What topics have we covered in CS 202? Part 1: Completed! CS 202: Introduction to Computation " UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department Professor Andrea Arpaci-Dusseau How can computation use data to solve problems? Congrats to Game Winners

More information

Math Journal HMH Mega Math. itools Number

Math Journal HMH Mega Math. itools Number Lesson 1.1 Algebra Number Patterns CC.3.OA.9 Identify arithmetic patterns (including patterns in the addition table or multiplication table), and explain them using properties of operations. Identify and

More information

Introduction to Fractions, Equivalent and Simplifying (1-2 days)

Introduction to Fractions, Equivalent and Simplifying (1-2 days) Introduction to Fractions, Equivalent and Simplifying (1-2 days) 1. Fraction 2. Numerator 3. Denominator 4. Equivalent 5. Simplest form Real World Examples: 1. Fractions in general, why and where we use

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

More information

Introduction to the Mathematics of Big Data. Philippe B. Laval

Introduction to the Mathematics of Big Data. Philippe B. Laval Introduction to the Mathematics of Big Data Philippe B. Laval Fall 2015 Introduction In recent years, Big Data has become more than just a buzz word. Every major field of science, engineering, business,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Big Data in OpenTopography

Big Data in OpenTopography Big Data in OpenTopography Vishu Nandigam San Diego Supercomputer Center NSF Big Data in Educa

More information

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends Spring 2015 Thomas Hill, Ph.D. VP Analytic Solutions Dell Statistica Overview and Agenda Dell Software overview Dell in

More information

Search engine ranking

Search engine ranking Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 417 422. Search engine ranking Mária Princz Faculty of Technical Engineering, University

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Database Fundamentals

Database Fundamentals Database Fundamentals Computer Science 105 Boston University David G. Sullivan, Ph.D. Bit = 0 or 1 Measuring Data: Bits and Bytes One byte is 8 bits. example: 01101100 Other common units: name approximate

More information

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

More information

How to report the percentage of explained common variance in exploratory factor analysis

How to report the percentage of explained common variance in exploratory factor analysis UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

RECORDS & INFORMATION MANAGEMENT

RECORDS & INFORMATION MANAGEMENT RECORDS & INFORMATION MANAGEMENT GOVERNANCE & POLICY DEVELOPMENT, WORKING WITH STAKEHOLDERS AND POLICY IMPLEMENTATION PRESENTED BY LAUREN BARNES, CRM SPONSORS: THE ARCHIVISTS ROUND TABLE NEW YORK & ARMA

More information

What Is Singapore Math?

What Is Singapore Math? What Is Singapore Math? You may be wondering what Singapore Math is all about, and with good reason. This is a totally new kind of math for you and your child. What you may not know is that Singapore has

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Beyond "Big data": Introducing the EOI framework for analytics teams to drive business impact

Beyond Big data: Introducing the EOI framework for analytics teams to drive business impact Beyond "Big data": Introducing the EOI framework for analytics teams to drive business impact Michael Li Business Analytics, LinkedIn Nov 20, 2014 J onathan Wu raveen Neppalli Naga P Chi-Yi Kuan Business

More information

PURSUITS IN MATHEMATICS often produce elementary functions as solutions that need to be

PURSUITS IN MATHEMATICS often produce elementary functions as solutions that need to be Fast Approximation of the Tangent, Hyperbolic Tangent, Exponential and Logarithmic Functions 2007 Ron Doerfler http://www.myreckonings.com June 27, 2007 Abstract There are some of us who enjoy using our

More information

NF5-12 Flexibility with Equivalent Fractions and Pages 110 112

NF5-12 Flexibility with Equivalent Fractions and Pages 110 112 NF5- Flexibility with Equivalent Fractions and Pages 0 Lowest Terms STANDARDS preparation for 5.NF.A., 5.NF.A. Goals Students will equivalent fractions using division and reduce fractions to lowest terms.

More information

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global CSCA0102 IT & Business Applications Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global Chapter 2 Data Storage Concepts System Unit The system unit

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Chapter 3: Computer Hardware Components: CPU, Memory, and I/O

Chapter 3: Computer Hardware Components: CPU, Memory, and I/O Chapter 3: Computer Hardware Components: CPU, Memory, and I/O What is the typical configuration of a computer sold today? The Computer Continuum 1-1 Computer Hardware Components In this chapter: How did

More information

Optimization of Preventive Maintenance Scheduling in Processing Plants

Optimization of Preventive Maintenance Scheduling in Processing Plants 18 th European Symposium on Computer Aided Process Engineering ESCAPE 18 Bertrand Braunschweig and Xavier Joulia (Editors) 2008 Elsevier B.V./Ltd. All rights reserved. Optimization of Preventive Maintenance

More information

Prime Time: Homework Examples from ACE

Prime Time: Homework Examples from ACE Prime Time: Homework Examples from ACE Investigation 1: Building on Factors and Multiples, ACE #8, 28 Investigation 2: Common Multiples and Common Factors, ACE #11, 16, 17, 28 Investigation 3: Factorizations:

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

THE ROSEN MARKET TIMING LETTER

THE ROSEN MARKET TIMING LETTER THE ROSEN MARKET TIMING LETTER PRECIOUS METALS - FOREX - STOCK INDICES - COMMODITIES www.deltasociety.com/product/rosen-market-timing-letter RONALD L. ROSEN June 10, 2016 REPORT AND REVIEW GOLD AND SILVER

More information

Stat 20: Intro to Probability and Statistics

Stat 20: Intro to Probability and Statistics Stat 20: Intro to Probability and Statistics Lecture 16: More Box Models Tessa L. Childers-Day UC Berkeley 22 July 2014 By the end of this lecture... You will be able to: Determine what we expect the sum

More information

Introduction To Computers: Hardware and Software

Introduction To Computers: Hardware and Software What Is Hardware? Introduction To Computers: Hardware and Software A computer is made up of hardware. Hardware is the physical components of a computer system e.g., a monitor, keyboard, mouse and the computer

More information

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World Chapter 4 System Unit Components Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

The following are general terms that we have found being used by tenants, landlords, IT Staff and consultants when discussing facility space.

The following are general terms that we have found being used by tenants, landlords, IT Staff and consultants when discussing facility space. The following are general terms that we have found being used by tenants, landlords, IT Staff and consultants when discussing facility space. Terminology: Telco: Dmarc: NOC: SAN: GENSET: Switch: Blade

More information

Day 1. Mental Arithmetic Questions. 1. What number is five cubed? 2. A circle has radius r. What is the formula for the area of the circle?

Day 1. Mental Arithmetic Questions. 1. What number is five cubed? 2. A circle has radius r. What is the formula for the area of the circle? Mental Arithmetic Questions 1. What number is five cubed? KS3 MATHEMATICS 10 4 10 Level 6 Questions Day 1 2. A circle has radius r. What is the formula for the area of the circle? 3. Jenny and Mark share

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Text Analytics using High Performance SAS Text Miner

Text Analytics using High Performance SAS Text Miner Text Analytics using High Performance SAS Text Miner Edward R. Jones, Ph.D. Exec. Vice Pres.; Texas A&M Statistical Services Abstract: The latest release of SAS Enterprise Miner, version 13.1, contains

More information

The 5 P s in Problem Solving *prob lem: a source of perplexity, distress, or vexation. *solve: to find a solution, explanation, or answer for

The 5 P s in Problem Solving *prob lem: a source of perplexity, distress, or vexation. *solve: to find a solution, explanation, or answer for The 5 P s in Problem Solving 1 How do other people solve problems? The 5 P s in Problem Solving *prob lem: a source of perplexity, distress, or vexation *solve: to find a solution, explanation, or answer

More information

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015.

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015. Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment -3, Probability and Statistics, March 05. Due:-March 5, 05.. Show that the function 0 for x < x+ F (x) = 4 for x < for x

More information

IP Video Rendering Basics

IP Video Rendering Basics CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.

More information

Main Memory & Backing Store. Main memory backing storage devices

Main Memory & Backing Store. Main memory backing storage devices Main Memory & Backing Store Main memory backing storage devices 1 Introduction computers store programs & data in two different ways: nmain memory ntemporarily stores programs & data that are being processed

More information