Lecture #2. Algorithms for Big Data
|
|
|
- Owen Horn
- 10 years ago
- Views:
Transcription
1 Additional Topics: Big Data Lecture #2 Algorithms for Big Data Joseph Bonneau April 30, 2012
2 Today's topic: algorithms
3 Do we need new algorithms? Quantity is a quality of its own Joseph Stalin, apocryphal can't always store all data memory vs. disk becomes critical algorithms with limited passes N2 is impossible online/streaming algorithms approximate algorithms human insight is limited algorithms for high dimensional data
4 Simple algorithms, more data Mining of Massive Datasets Anand Rajaraman, Jeffrey Ullman 2010 Plus Stanford course, pieces adapted here Synopsis Data Structures for Massive Data Sets Phillip Gibbons, Yossi Mattias, 1998 The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, Fernando Perreira, 2010
5 The easy cases sorting Google 1 trillion items, (1 PB) sorted in 6 hours searching hashing & distributed search
6 Streaming algorithms Have we seen x before?
7 Bloom filter (Bloom 1970) assume we can store n bits b1, b2, bn use a hash function H(x) = h1, h2, hk each hi [1, n] when we see x, set bi, for each i in H(x) n = 8, k = 4 B =
8 Bloom filter (Bloom 1970) assume we can store n bits b1, b2, bn use a hash function H(x) = h1, h2, hk each hi [1, n] when we see x, set bi, for each i in H(x) n = 8, k = 4 H(jcb82) = 4, 7, 5, 2 B =
9 Bloom filter (Bloom 1970) assume we can store n bits b1, b2, bn use a hash function H(x) = h1, h2, hk each hi [1, n] when we see x, set bi, for each i in H(x) n = 8, k = 4 H(jcb82) = H(rja14) = 4, 7, 5, 2 2, 4, 3, 8 B =
10 Bloom filter (Bloom 1970) assume we can store n bits b1, b2, bn use a hash function H(x) = h1, h2, hk each hi [1, n] when we see x, set bi, for each i in H(x) n = 8, k = 4 H(jcb82) = H(rja14) = H(mgk25) = 4, 7, 5, 2 2, 4, 3, 8 8, 3, 7, 1 B =
11 Bloom filter (Bloom 1970) assume we can store n bits b1, b2, bn use a hash function H(x) = h1, h2, hk each hi [1, n] when we see x, set bi, for each i in H(x) n = 8, k = 4 H(jcb82) = H(rja14) = H(mgk25) = H(fms27) = 4, 7, 5, 2 2, 4, 3, 8 8, 3, 7, 1 1, 5, 2, 4 B =
12 Bloom filter (Bloom 1970) assume we can store n bits b1, b2, bn use a hash function H(x) = h1, h2, hk each hi [1, n] when we see x, set bi, for each i in H(x) n = 8, k = 4 H(jcb82) = H(rja14) = H(mgk25) = H(fms27) = 4, 7, 5, 2 2, 4, 3, 8 8, 3, 7, 1 1, 5, 2, 4 B = false positive
13 Bloom filters constant time lookup/insertion no false negatives false positives from N items: ln p = ( N/n) / (ln 2)2 at 1% p, 1 GB of RAM 77 billion unique items! once full, can't expand counting bloom filter: store integers instead of bits
14 Application: market baskets assume TESCO stocks 100k items market basket: {ravioli, pesto sauce, olive oil, wine} perhaps 1 billion baskets per year Amazon stocks many more 2100,000 possible baskets... what items are frequently purchased together? more frequent than predicted by chance
15 Application: market baskets pass #1: add all sub baskets to counting BF with 8 GB of RAM: 25B baskets 100,000 ( restrict basket size > 100 trillion 3 ) pass #2: check all sub baskets check {x}, {y1,, yn} {x, y1, yn} store possible rules {y1,, yn} {x} pass #3: eliminate false positives for better approaches see Ramarajan/Ullman
16 Other streaming challenges rolling average of previous k items hot list most frequent items seen so far sliding window of traffic volume probabilistically start tracking new items histogram of values
17 Querying data streams Select segno, dir, hwy From SegSpeedStr [Range 5 Minutes] Group By segno, dir, hwy Having Avg(speed) < 40 Continuous Query Language (CQL) Oracle now shipping version 1 based on SQL...
18 Similarity metrics Which items are similar to x?
19 Similarity metrics Which items are similar to x? reduce x to a high dimensional vector features: words in a document (bag of words) shingles (sequences of w characters) various body part measurements (faces) edges, colours in a photograph
20 Feature extraction - images Gist Siagan, Itti, 2007
21 Distance metrics Chebyshev distance max i ( x i y i ) Manhattan distance x i y i Hamming distance for binary vectors Euclidean distance (x y ) Mahalanobis distance Adjusts for different variance Cosine distance Jaccard distance (binary) 2 i i ( x i y i )2 σ2 cos θ= x i y i X Y X Y X Y X Y
22 The curse of dimensionality d = 2: area( ) = π, area( ) = 4 d = 3: area( ) = (4/3) π, area( ) = 8 ratio = 0.52 d : ratio = 0.78 ratio 0! all points are far by Euclidean distance
23 The curse of dimensionality space is typically very sparse most dimensions are semantically useless hard for humans to tell which ones need dimension reduction
24 Singular value decomposition D [ m n]=u [ m n ] Σ [n n] V books to genres books to words (input data) T [n n] genres to words genre strength (diagonal) see Baker '05
25 Singular value decomposition [ [ books to words (input data) ] ] [ books to words (approximate) = books to genres ] [ ][ genre strength (diagonal) genres to words ] see Baker '05
26 Singular value decomposition [ [ books to words (input data) ] ] [ books to words (approximate) books to genres = books to genres [ ][ genre strength (diagonal) ] genres to words ] see Baker '05
27 Singular value decomposition [ [ books to words (input data) ] ] [ books to words (approximate) = genres to words books to genres ] [ ][ genre strength (diagonal) genres to words ] see Baker '05
28 Singular value decomposition 2 computation takes O(mn ), with m > n useful but out of reach for largest datasets implemented in most statistics packages R, MATLAB, etc (often) better linear algebra approaches exist CUR, CMD decomposition
29 Locality-sensitive hashing replace SVD with simple probabilistic approach choose a family of hash functions H such that: distance(x, Y) distance(h(x), H(Y)) H can produce any number of bits compute several different H investigate further if H(X) = H(Y) exactly scale output size of H to minimise cost
30 MinHash implementation compute a random permutation σr(x) count the number of 0's before a 1 appears theorem: pr[mh(x) = MH(Y)] = 1 Jaccard(X, Y) combine multiple permutations to add bits
31 LSH example Flickr photos Kulis and Grauman, 2010
32 Recommendation systems Can we recommend books, films, products to users based on their personal tastes?
33 Recommendation systems Anderson 2004
34 Content-based filtering extract features from content actors director genre keywords in plot summary etc. find nearest neighbours to what a user likes or buys
35 Content-based filtering PROS rate new items no herding no user information exposed to others CONS features may not be relevant recommendations may be boring filter bubble
36 Collaborative filtering features are user/item interactions purchases explicit ratings need lots of clean up, scaling user user filtering: find similar users suggest their top ratings scale for each user's rating style item item filtering: find similar items suggest most similar items
37 Item-item filtering preferred Koren, Bell, Volinksy, 2009
38 Collaborative filtering PROS automatic feature extraction surprising recommendations CONS ratings for new users/items herding privacy
39 The Netflix Prize, M film ratings made available 480k users 17k films (shoddily) anonymised 3M ratings held in reserve goal: improve predictions by 10% measure by RMS error US $1 M prize attracted over 2500 teams
40 Netflix Prize insights need to heavily normalise user ratings some users more critical than others user rating style temporal bias latent item factors is strongest approach
41 Netflix Prize league table
42 Netflix Prize aftermath winning solution is a messy hybrid never implemented in practice! too much pain for little gain...
43 Clustering Can we identify groups of similar items?
44 Clustering Can we identify groups of similar items?
45 Clustering identify categories of: organisms (species) customers (tastes) graph nodes (community detection) develop cluster scores based on item similarity diameter of clusters? radius of clusters? average distance? distance from other clusters?
46 Hierarchical clustering 2 O(N lg N) Lindblad-Toh et al. 2005
47 Approximate k-means choose # of categories k in advance can test various values draw a random RAM full of items cluster them optimally into k clusters identify the centre centroid: average of points (Euclidean) clustroid: closest to other points assign remaining points to closest centre O(N) time
48 Classification? Given some examples, can we classify new items?
49 Classification is this item... a spam ? a cancerous cell? the letter 'J'? many approaches exist: neural networks Bayesian decision trees domain specific probabilistic models
50 Manual classification Nowak et al. 2011
51 Manual classification ESP Game, Von Ahn et al.
52 k-nearest neighbour classifier? Classify a point by majority vote of its k nearest neighbors
53 k-nearest neighbour classifier? k can make a big difference!
54 Bias-variance tradeoff too much variance
55 Bias-variance tradeoff too much bias
56 Linear classifiers? In high dimensional space, linear is often best (If not? Map to a different space and linearize kernel machines)
57 Support vector machines (SVM)? find hyperplane with maximal distance between any points closest points define the plane (support vectors)
58 SVM example cat or dog? Golle 2007
59 Machine learning for dummies many other algorithms for classifying/clustering learn the concepts and what you can tweak let others do the hard work libraries: Shogun, libsvm, scikit learn Apache Mahout: works with Hadoop! outsource to Kaggle... more in the next lecture
60 Thank you
Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Collaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
B490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor
SoSe 2014: M-TANI: Big Data Analytics
SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering
Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
Practical Introduction to Machine Learning and Optimization. Alessio Signorini <[email protected]>
Practical Introduction to Machine Learning and Optimization Alessio Signorini Everyday's Optimizations Although you may not know, everybody uses daily some sort of optimization
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Combining SVM classifiers for email anti-spam filtering
Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)
Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data 04 02 --- 04 01 --- 05 Analytics. Theory Marks
Teaching Scheme Credits Assigned Course Code Course Hrs./Week Name Theory Practical Tutorial Theory Practical/Oral Tutorial Tota l BEITC802 Big Data 04 02 --- 04 01 --- 05 Analytics Examination Scheme
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
The Delicate Art of Flower Classification
The Delicate Art of Flower Classification Paul Vicol Simon Fraser University University Burnaby, BC [email protected] Note: The following is my contribution to a group project for a graduate machine learning
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances
240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Data Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
Big Data Analytics and Optimization
Big Data Analytics and Optimization C e r t i f i c a t e P r o g r a m i n E n g i n e e r i n g E x c e l l e n c e e.edu.in http://www.insof LIST OF COURSES Essential Business Skills for a Data Scientist...
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Outline. What is Big data and where they come from? How we deal with Big data?
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
MACHINE LEARNING BASICS WITH R
MACHINE LEARNING [Hands-on Introduction of Supervised Machine Learning Methods] DURATION 2 DAY The field of machine learning is concerned with the question of how to construct computer programs that automatically
CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS
CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS COURSE OVERVIEW & STRUCTURE Fall 2015 Marion Neumann ABOUT Marion Neumann email: m dot neumann at wustl dot edu office: Jolley Hall 403 office hours:
Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo
Software Engineering for Big Data CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Big Data Big data technologies describe a new generation of technologies that aim
CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015
CPSC 340: Machine Learning and Data Mining Mark Schmidt University of British Columbia Fall 2015 Outline 1) Intro to Machine Learning and Data Mining: Big data phenomenon and types of data. Definitions
Introduction to Data Science: CptS 483-06 Syllabus First Offering: Fall 2015
Course Information Introduction to Data Science: CptS 483-06 Syllabus First Offering: Fall 2015 Credit Hours: 3 Semester: Fall 2015 Meeting times and location: MWF, 12:10 13:00, Sloan 163 Course website:
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Machine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected]
Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected] What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
What is Data Science? Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014
What is Data Science? { Data, Databases, and the Extraction of Knowledge Renée T., @becomingdatasci, November 2014 Let s start with: What is Data? http://upload.wikimedia.org/wikipedia/commons/f/f0/darpa
The Need for Training in Big Data: Experiences and Case Studies
The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor
CS 207 - Data Science and Visualization Spring 2016
CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler [email protected] An introduction to techniques for the automated and human-assisted analysis of data sets. These
Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
