Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research
|
|
|
- Cornelius Poole
- 10 years ago
- Views:
Transcription
1 Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research
2 Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured and unstructured data VERACITY Low quality data, inconsistencies Need to ensure quality of data
3 This course Develop algorithms to deal with such data Emphasis on different models for processing data Common techniques and fundamental paradigm Style: algorithmic/theoretical Background in basic algorithms and probability However Important problems and practical solutions (mostly) Applied projects are OK and encouraged (more discussions to follow)
4 Grading Scribing once: 20% Participation (class and blog): 20% all information, lecture notes will be posted there Add references Add datasets Write posts on interesting articles Discuss questions from the lecture Project: 60% Survey: 30% Read ~5 papers (~2 for background+~3 recent developments) Out of class presentation/discussion Project presentation+ Write-up: 30% Goal to have a project worthy of publication in a good conference in theory/ data bases/ data mining. NO EXAMS
5 Timeline Scribe notes Due before next class. Template will be provided. Oct 3 rd : submit names of 5 survey papers Encouraged to consult with me before Oct 17 th : project proposal due Max 2 pages + references Must be in latex, single column, article 11 pt. Nov 14 th : a one page write-up describing progress on project due Dec 5 th : project final write-up due At most 10 pages in latex + references+ appendix (if required to include omitted proofs/experiments) Introduction with motivation + related works+ technical part + experimental section
6 Office Hours By appointment: Friday 1pm to 5pm Send to fix a time [email protected], [email protected] Where: KHKH Encouraged to discuss the progress on projects with me throughout the semester
7 Tentative Syllabus Models Small memory algorithms External memory algorithms Distributed algorithms (Map Reduce) Crowdsourcing (People assisted computing) Focus: Efficiency vs Quality Near linear time algorithm design Incremental and update efficient algorithm design Sub-linear algorithm design / property testing Sparse transformation Dimensionality reduction Metric embedding
8 Tentative Course Plan Sept 5 th Overview, Introduction to Data Streaming Sept 12 th Sept 19 th Sept 26 th Oct 3 rd Semi-streaming and External Memory Algorithms Oct 10 th Oct 17 th Oct 24 th Property testing Streaming: Sketching + Sampling, Dimensionality Reduction Map Reduce Algorithms Oct 31 st Sparse transformation or near-linear time algorithm design Nov 7 th Metric embedding Nov 14 th Crowdsourcing Nov 21 st Project Presentation Nov 28 th Thanksgiving break Dec 5 th Project Presentation
9 Plan for this lecture 1 st half: Overview of the course topics 2 nd half: Introduction to Data Streaming
10 Models Different models need different algorithms for the same problem Default: Main Memory Model External Memory Model Streaming Model MapReduce Crowdsourcing 1. Do you have enough main memory? 2. How much disk I/O are you performing? 3. Is your data changing fast? 4. Can you distribute your data to multiple servers for fast processing? 5. Is your data ambiguous that it needs human power to process?
11 Counting Distinct Elements Given a sequence A= a 1, a 2,, a m where a i {1 n}, compute the number of distinct elements in A (denoted by A ). Natural and popular statistics, eg. Given the list of transactions, compute the number of different customers (i.e. credit card numbers) What is the size of the web vocabulary? Example: distinct elements=7
12 Counting Distinct Elements Default model: Random Access Main Memory Model Maintain an array of size n: B*1,,n+ initially set to all 0 If item i arrives set B*i]=1 Count the number of 1 s in B
13 Counting Distinct Elements Default model: Random Access Memory Model Maintain an array of size n: B*1,,n+ initially set to all 0 If item i arrives set B*i]=1 Count the number of 1 s in B O(m) running time Requires random access to B (? ) Requires space n even though the number of distinct elements is small or m < n domain may be much larger
14 Counting Distinct Elements Default model: Random Access Memory Model Initialize count=0, an array of lists B*1.O(m)+ and a hash function h :,1.n-,1 O(m)- For each a i Compute j=h(a i ) Check if a i occurs in the list pointed to by B[j] If not, count=count+1 and add a i to the list Return count Assuming that h(.) is random enough, running time is O(m), space usage O(m). PROVE IT! Space is still O(m) Random access to B for each input
15 Counting Distinct Elements External Memory Model M units of main memory Input size m, m >> M Data is stored on disk: Space divided into blocks, each of size B <=M Transferring one block of data into the main memory takes unit time Main memory operations for free but disk I/O is costly Goal is to reduce number of disk I/O
16 Distinct Elements in External Memory Sorting in external memory External Merge sort Split the data into M/B segments Recursively sort each segment Merge the segments using m/b block accesses Example: M/B=3 No of disk I/O/merging= m/b No of recursion call=log M/B m Total sorting time=m/b log M/B m
17 Distinct Elements in External Memory Sorting in external memory External Merge sort Split the data into M/B segments Recursively sort each segment Merge the segments using m/b block accesses Count=1 For j=2,,m If a j > a j-1 count=count+1 Example: M/B=3 No of disk I/O/merging= m/b No of recursion call=log M/B m Total sorting time=m/b log M/B m
18 Distinct Elements in Streaming Model Streaming Model Data comes in streaming fashion one at a time (suppose from CD-ROM or cash-register) M units of main memory, M << m Only one pass over data Data not stored is lost
19 Distinct Elements in Streaming Model Suppose you want to know if the number of distinct elements is at least t Initialize a hash function h:,1, n-,1,,t- Initialize the answer to NO For each a i : If h(a i ) ==1, then set the answer to YES The algorithm uses only 1 bit of storage! (not counting the random bits for h)
20 Distinct Elements in Streaming Model Suppose you want to know if the number of distinct elements is at least t Initialize a hash function h:,1, m-,1,,t- Initialize the answer to NO, count=0 For each a i : If h(a i ) ==1, then count++ (this run returns YES) Repeat the above procedure for log n different hash functions from the family Set YES if count > log n (1-1/e) [Boosting the confidence] The algorithm uses log n bit of storage! (not counting the random bits for h) Run log(n) algorithms in parallel using t=2,4,8, n Approximate answers with high probability > 1-1/n Space usage O(log 2 n)
21 Distinct Elements in Streaming Model Suppose you want to know if the number of distinct elements is at least t Initialize a hash function h:,1, m-,1,,t- Initialize the answer to NO, count=0 For each a i : If h(a i ) ==1, then count++ (this run returns YES) Repeat the above procedure for log n different hash functions from the family Set YES if count > log n (1-1/e) [Boosting the confidence] The algorithm uses log n bit of storage! (not counting the random bits for h) Run log(n) algorithms in parallel using t=2,4,8, n Approximate answers with high probability > 1-1/n Space usage O(log 2 n) Approximation and Randomization are essential!
22 MapReduce Model Hardware is relatively cheap Plenty of parallel algorithms designed but Parallel programming is hard Threaded programs are difficult to test, debug, synchronization issues, more machines mean more breakdown MapReduce makes parallel programming easy
23 MapReduce Model MapReduce makes parallel programming easy Tracks the jobs and restarts if needed Takes care of data distribution and synchronization But there is no free lunch: Imposes a structure on the data Only allows for certain kind of parallelism
24 MapReduce Model Data: Represented as <Key, Value> pairs Map: Data List < Key, Value> [programmer specified] Shuffle: Aggregate all pairs with the same key [handled by system] Reduce: <Key, List(Value)> <Key, List(Value)> [programmer specified]
25 Distinct Elements in MapReduce r servers Data [1,a 1 ], [2,a 2 +,., *n,a m ] Map [1,a 1 ], [2,a 2 +,., *n,a m ] [1,a 1 ], [1,a 2 +, *1,a m/r ],[2,a m/r+1 +,., *2,a 2m/r +,.,*r,a m ] Reduce Reducer 1: [1,a 1 ], [1,a 2 +, *1,a m/r ] [1,a 1 ], [1,a 2 +, *1,a m/r ],[1,h()] function) generates the hash Reducer 2: [2,a m/r+1 ], [2,a m/r+2 +, *2,a 2m/r ] [2,a m/r+1 ], [2,a m/r+2 +, *2,a 2m/r ] Map [1,a 1 ], [1,a 2 +, *1,a m/r ],[2,a m/r+1 +,., *2,a 2m/r +,.,*r,a m ],[1,h()] [1,a 1 ], [1,a 2 +, *1,a m/r ],[2,a m/r+1 +,., *2,a 2m/r +,.,*r,a m +, *1,h()+, *2,h()+,., *r,h()] makes multiple copies of the hash function for distribution Reduce Reducer 1: [1,a 1 ], [1,a 2 +, *1,a N/r ],[1,h()], create sketch B 1, outputs [1,B 1 ] Map [1,B 1 ], [2,B 2 +,.,*r,b r ] [1,B 1 ], [1,B 2 +,.,*1,B r ] gathers all the sketches Reduce Reducer1: [1,B 1 ], [1,B 2 +,.,*1,B r ], computes B= B 1 +B B r,, Follows the Streaming Algorithm to compute distinct elements from the sketch
26 Crowdsourcing Incorporating human power for data gathering and computing People still outperform state-of-the-art algorithms for many data intensive tasks Typically involve ambiguity, deep understanding of language or context or subjective reasoning
27 Distinct Elements by Crowdsourcing Ask for each pair if they are equal Create a graph with each element as node Add an edge between two nodes if the corresponding pairs are returned to be equal Return number of connected components Also known as record linkage, entity resolution, deduplication
28 Distinct Elements by Crowdsourcing Distinct elements=4
29 Distinct Elements by Crowdsourcing Distinct elements=4 Too many questions to crowd! Costly. Can we reduce the number of questions?
30 Scalable Algorithm Design Near linear time algorithm design Seemingly best one can do since reading data needs linear time Is linear time good enough? Don t even read the entire data and return result!! Hope: do not require exact answer
31 Scalable Algorithm Design Near linear time algorithm design Seemingly best one can do since reading data needs linear time Is linear time good enough? Don t even read the entire data and return result!! Hope: do not require exact answer Property Testing
32 In the ballpark vs. out of the ballpark tests Distinguish inputs that have specific property from those that are far from having the property Benefits: May be the natural question to ask May be just as good when data constantly changing Gives fast sanity check to rule out very bad inputs (i.e., restaurant bills) or to decide when expensive processing is worth it
33 Trend change analysis Transactions of yr olds Transactions of yr olds trend change?
34 Outbreak of diseases Do two diseases follow similar patterns? Are they correlated with income level or zip code? Are they more prevalent near certain areas?
35 Is the lottery uniform? New Jersey Pick-k Lottery (k =3,4) Pick k digits in order. 10 k possible values. Are all values equally likely to occur?
36 Global statistical properties: Decisions based on samples of distribution Properties: similarities, correlations, information content, distribution of data,
37 CATCTGTATTGAT (length 13) match size =11 Another Example Pattern matching on Strings Are two strings similar or not? (number of deletions/insertions to change one into the other) Text Website content DNA sequences ACTGCTGTACTGACT (length 15)
38 Pattern matching on Strings Previous algorithms using classical techniques for computing edit distance on strings of size n use at least n 2 time For strings of size 1000, this is 1,000,000 Can you compute edit distance in near linear time? Can you test whether two strings have edit distance below c (small) or above c (big) in sub-linear time?
39 Pattern matching on Strings Previous algorithms using classical techniques for computing edit distance on strings of size n use at least n 2 time For strings of size 1000, this is 1,000,000 Can we compute edit distance in near linear time? Various methods, key is metric embedding Can we test whether two strings have edit distance below c (small) or above c (big) in sub-linear time? Property testing
40 Metric Embedding Metric space is a pair (X, d) where X is a set and d: X x X *0, + is a metric satisfying x 1 x 2 x 1 x 2 i.) d(x,y)=0 if and only if x=y ii.) d(x,y)=d(y,x) iii.) d(x,y)+d(y,z) >= d(x,z) x n f: X R 2 x n d(x a, x b )=d(f(x a ), f(x b )) Dissimilarity matrix for bacterial strain in microbiology 1.) succinct representation 2.) easy to understand structure 3.) efficient algorithms
41 Metric Embedding Metric space is a pair (X, d) where X is a set and d: X x X *0, + is a metric satisfying x 1 x 2 x 1 x 2 i.) d(x,y)=0 if and only if x=y ii.) d(x,y)=d(y,x) iii.) d(x,y)+d(y,z) >= d(x,z) x n f: X R 2 Isometric Embedding x n d(x a, x b )=d(f(x a ), f(x b )) Dissimilarity matrix for bacterial strain in microbiology 1.) succinct representation 2.) easy to understand structure 3.) efficient algorithms
42 Metric Embedding C-embedding: Embedding with distortion Example General n-point metric Tree metric C=O(log n) Specific metrics normed space : edit distance (Levenshtein distance to low dimensional l 1 ) Graphs t-spanner C=t f: X X, (X,d) and (X,d ) are metric r (0, ) s.t r d(x a, x b ) <=d(f(x a ), f(x b ))<= C r d(x a, x b ) Distortion=C High dimensional spaces low dimensional spaces : Johnson-Lindenstrauss theorem: flattening in l 2,C=(1+epsilon) [Dimensionality Reduction]
43 Sparse Transformation Sparse Fourier Transform Sparse Johnson-Lindenstrauss Fast dimensionality reduction
44 Fourier Transform Discrete Fourier Transform: Given: a signal x*1 n] Goal: compute the frequency vector x where x f = Σ t x t e-2πi tf/n Very useful tool: Compression (audio, image, video) Data analysis Feature extraction Sampled Audio Data (Time) DFT of Audio Samples (Frequency) See SIGMOD 04 tutorial Indexing and Mining Streams by C. Faloutsos
45 Computing DFT Fast Fourier Transform (FFT) computes the frequencies in time O(n log n) But, we can do (much) better if we only care about small number k of dominant frequencies E.g., recover assume it is k-sparse (only k non-zero entries) Exactly k-sparse signals: O(k log n) Approx. k-sparse signals* : O(k log n * log(n/k))
46 Agenda Introduction to Data Streaming Model Finding Frequent Items Deterministically Lower bound for deterministic computation for distinct items Interlude to concentration inequality Analysis of counting distinct items algorithms
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Sublinear Algorithms for Big Data. Part 4: Random Topics
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)
Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property Venkat Chandar March 1, 2008 Abstract In this note, we prove that matrices whose entries are all 0 or 1 cannot achieve
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
B669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Distance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
Hardware Configuration Guide
Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
CAS CS 565, Data Mining
CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, [email protected] Office hours: Mon 2:30-4pm,
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
2010-2011 Assessment for Master s Degree Program Fall 2010 - Spring 2011 Computer Science Dept. Texas A&M University - Commerce
2010-2011 Assessment for Master s Degree Program Fall 2010 - Spring 2011 Computer Science Dept. Texas A&M University - Commerce Program Objective #1 (PO1):Students will be able to demonstrate a broad knowledge
Sibyl: a system for large scale machine learning
Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning
(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7
(67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]
Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
B490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
CIS 700: algorithms for Big Data
CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us Sketching Graphs? We know how to sketch vectors: v Mv
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
CS 300 Data Structures Syllabus - Fall 2014
CS 300 Data Structures Syllabus - Fall 2014 Catalog Description Data structures are fundamental to advanced, efficient programming. Topics including asymptotic analysis, stacks, queues, linked lists, trees,
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015
Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015 Lecture: MWF: 1:00-1:50pm, GEOLOGY 4645 Instructor: Mihai
Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science
Using Data-Oblivious Algorithms for Private Cloud Storage Access Michael T. Goodrich Dept. of Computer Science Privacy in the Cloud Alice owns a large data set, which she outsources to an honest-but-curious
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Algorithms and Data Structures
Algorithms and Data Structures CMPSC 465 LECTURES 20-21 Priority Queues and Binary Heaps Adam Smith S. Raskhodnikova and A. Smith. Based on slides by C. Leiserson and E. Demaine. 1 Trees Rooted Tree: collection
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students
Eastern Washington University Department of Computer Science Questionnaire for Prospective Masters in Computer Science Students I. Personal Information Name: Last First M.I. Mailing Address: Permanent
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Image Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
Quantum Computing Lecture 7. Quantum Factoring. Anuj Dawar
Quantum Computing Lecture 7 Quantum Factoring Anuj Dawar Quantum Factoring A polynomial time quantum algorithm for factoring numbers was published by Peter Shor in 1994. polynomial time here means that
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Discuss the size of the instance for the minimum spanning tree problem.
3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can
Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data
CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address
The Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
How To Understand And Understand An Operating System In C Programming
ELEC 377 Operating Systems Thomas R. Dean Instructor Tom Dean Office:! WLH 421 Email:! [email protected] Hours:! Wed 14:30 16:00 (Tentative)! and by appointment! 6 years industrial experience ECE Rep
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)
Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...
What s New in MATLAB and Simulink
What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Analysis of Algorithms, I
Analysis of Algorithms, I CSOR W4231.002 Eleni Drinea Computer Science Department Columbia University Thursday, February 26, 2015 Outline 1 Recap 2 Representing graphs 3 Breadth-first search (BFS) 4 Applications
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
University of Dayton Department of Computer Science Undergraduate Programs Assessment Plan DRAFT September 14, 2011
University of Dayton Department of Computer Science Undergraduate Programs Assessment Plan DRAFT September 14, 2011 Department Mission The Department of Computer Science in the College of Arts and Sciences
Virtual Landmarks for the Internet
Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
The Big Data Paradigm Shift. Insight Through Automation
The Big Data Paradigm Shift Insight Through Automation Agenda The Problem Emcien s Solution: Algorithms solve data related business problems How Does the Technology Work? Case Studies 2013 Emcien, Inc.
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale
What is the University of Florida EDGE Program? EDGE enables engineering professional, military members, and students worldwide to participate in courses, certificates, and degree programs from the UF
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
CLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
