Scaling Up Machine Learning
|
|
|
- Carol Bradford
- 10 years ago
- Views:
Transcription
1 Scaling Up Machine Learning Parallel and Distributed Approaches Ron Bekkerman, LinkedIn Misha Bilenko, MSR John Langford, Y!R
2 Outline Introduction Tree Induction Break Graphical Models Learning on GPUs Linear Learning Conclusion Ron Misha Misha John John John
3 The book Cambridge Uni Press Due in November chapters Covering Platforms Algorithms Learning setups Applications
4 Chapter contributors
5 Previous books
6 Data hypergrowth: an example Reuters-21578: about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job title data: about 100M docs Bekkerman & Gavish, KDD
7 New age of big data The world has gone mobile 5 billion cellphones produce daily data Social networks have gone online Twitter produces 200M tweets a day Crowdsourcing is the reality Labeling of 100,000+ data instances is doable Within a week
8 Size matters One thousand data instances One million data instances One billion data instances One trillion data instances Those are not different numbers, those are different mindsets
9 One thousand data instances Will process it manually within a day (a week?) No need in an automatic approach We shouldn t publish main results on datasets of such size
10 One million data instances Currently, the most active zone Can be crowdsourced Can be processed by a quadratic algorithm Once parallelized 1M data collection cannot be too diverse But can be too homogenous Preprocessing / data probing is crucial
11 Big dataset cannot be too sparse 1M data instances cannot belong to 1M classes Simply because it s not practical to have 1M classes Here s a statistical experiment, in text domain: 1M documents Each document is 100 words long Randomly sampled from a unigram language model No stopwords 245M pairs have word overlap of 10% or more Real-world datasets are denser than random
12 Can big datasets be too dense? jpg jpg jpg jpg jpg jpg jpg jpg jpg jpg
13 Real-world example
14 (Near) duplicate detection Bekkerman et al, KDD 2009
15 One billion data instances Web-scale Guaranteed to contain data in different formats ASCII text, pictures, javascript code, PDF documents Guaranteed to contain (near) duplicates Likely to be badly preprocessed Storage is an issue
16 One trillion data instances Beyond the reach of the modern technology Peer-to-peer paradigm is (arguably) the only way to process the data Data privacy / inconsistency / skewness issues Can t be kept in one location Is intrinsically hard to sample
17 A solution to data privacy problem n machines with n private datasets All datasets intersect The intersection is shared Each machine learns a separate model D 6 Xiang et al, Chapter 16 D 1 D 2 D 5 D 4 Models get consistent over the data intersection Check out Chapter 16 to see this approach applied in a recommender system! D 3
18 So what model will we learn? Supervised model? Unsupervised model? Semi-supervised model? Obviously, depending on the application But also on availability of labeled data And its trustworthiness!
19 Size of training data Say you have 1K labeled and 1M unlabeled examples Labeled/unlabeled ratio: 0.1% Is 1K enough to train a supervised model? Now you have 1M labeled and 1B unlabeled examples Labeled/unlabeled ratio: 0.1% Is 1M enough to train a supervised model?
20 Skewness of training data Usually, training data comes from users Explicit user feedback might be misleading Feedback providers may have various incentives Learning from implicit feedback is a better idea E.g. clicking on Web search results In large-scale setups, skewness of training data is hard to detect
21 Real-world example Goal: find high-quality professionals on LinkedIn Idea: use recommendation data to train a model Whoever has recommendations is a positive example Is it a good idea?
22 Not enough (clean) training data? Use existing labels as a guidance rather than a directive In a semi-supervised clustering framework Or label more data! With a little help from the crowd
23 Semi-supervised clustering Cluster unlabeled data D while taking labeled data D * into account Construct clustering Mutual Information while maximizing And keeping the number of clusters k constant ~D * D ~ I( DD ~ ; ~ *) is defined naturally over classes in D * Bekkerman et al, ECML 2006 Results better than those of classification
24 Semi-supervised clustering (details) * I( DD ~ * ~ ~, ~ * ~ * ( ~ ( ~, ~ ; ~ ) Pd d ) ( ~ ) d D d DPdPd *) Define an empirical joint distribution P(D, D * ) P(d, d * ) is a normalized similarity between d and d * Define the joint between clusterings Pdd ( ~ Where, ~ ) ~, * ~ Pdd* d d d d* and are marginals P (D~ ) PD ( ~ * ) * (, ) PDD ( ~, ~ * )
25 Crowdsourcing labeled data Crowdsourcing is a tough business People are not machines Any worker who can game the system will game the system Validation framework + qualification tests are a must Labeling a lot of data can be fairly expensive
26 How to label 1M instances Budget a month of work + about $50,000
27 How to label 1M instances Hire a data annotation contractor in your town Presumably someone you know well enough
28 How to label 1M instances Offer the guy $10,000 for one month of work
29 How to label 1M instances Construct a qualification test for your job Hire 100 workers who pass it Keep their worker IDs
30 How to label 1M instances Explain to them the task, make sure they get it
31 How to label 1M instances Offer them 4 per data instance if they do it right
32 How to label 1M instances 500 Your contractor will label 500 data instances a day This data will be used to validate worker results
33 How to label 1M instances You ll need to spot-check the results
34 How to label 1M instances You ll need to spot-check the results
35 How to label 1M instances Each worker gets a daily task of 1000 data instances
36 How to label 1M instances Some of which are already labeled by the contractor
37 How to label 1M instances Check every worker s result on that validation set
38 How to label 1M instances Check every worker s result on that validation set
39 How to label 1M instances Fire the worst 50 workers Disregard their results
40 How to label 1M instances Hire 50 new ones
41 How to label 1M instances Repeat for a month (20 working days) 50 workers 20 days 1000 data points a day 4
42 Got 1M labeled instances, now what? Now go train your model Rule of the thumb: heavier algorithms produce better results Rule of the other thumb: forget about super-quadratic algorithms Parallelization looks unavoidable
43 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
44 Example: k-means clustering An EM-like algorithm: Initialize k cluster centroids E-step: associate each data instance with the closest centroid Find expected values of cluster assignments given the data and centroids M-step: recalculate centroids as an average of the associated data instances Find new centroids that maximize that expectation
45 Parallelizing k-means
46 Parallelizing k-means
47 Parallelizing k-means
48 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
49 Peer-to-peer (P2P) systems Millions of machines connected in a network Each machine can only contact its neighbors Each machine storing millions of data instances Practically unlimited scale Communication is the bottleneck Aggregation is costly, broadcast is cheaper Messages are sent over a spanning tree With an arbitrary node being the root
50 k-means in P2P Uniformly sample k centroids over P2P Using a random walk method Datta et al, TKDE 2009 Broadcast the centroids Run local k-means on each machine Sample n nodes Aggregate local centroids of those n nodes
51 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
52 Virtual clusters Datacenter-scale clusters Hundreds of thousands of machines Distributed file system Data redundancy Cloud computing paradigm Virtualization, full fault tolerance, pay-as-you-go MapReduce is #1 data processing scheme
53 MapReduce Mappers Reducers Process in parallel shuffle process in parallel Mappers output (key, value) records Records with the same key are sent to the same reducer
54 k-means on MapReduce Mappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Panda et al, Chapter 2
55 Discussion on MapReduce MapReduce is not designed for iterative processing Mappers read the same data again and again MapReduce looks too low-level to some people Data analysts are traditionally SQL folks MapReduce looks too high-level to others A lot of MapReduce logic is hard to adapt Example: grouping documents by words
56 MapReduce wrappers Many of them are available At different levels of stability Apache Pig is an SQL-like environment Group, Join, Filter rows, Filter columns (Foreach) Developed at Yahoo! Research Olston et al, SIGMOD 2008 DryadLINQ is a C#-like environment Developed at Microsoft Research Yu et al, OSDI 2008
57 k-means in Apache Pig: input data Assume we need to cluster documents Stored in a 3-column table D: Document Word Count doc1 new 2 doc1 york 2 Initial centroids are k randomly chosen docs Stored in table C in the same format as above
58 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, isum(i w i d w c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS wd i d c c2 ; SQR g = GROUP SQR BY c; c arg max d 2 i w w c c LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM);
59 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, isum(i w i d w c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS wd i d c c2 ; SQR g = GROUP SQR BY c; c arg max d 2 i w w c c LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM);
60 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, isum(i w i d w c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS wd i d c c2 ; SQR g = GROUP SQR BY c; c arg max d 2 i w w c c LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM);
61 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, isum(i w i d w c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS wd i d c c2 ; SQR g = GROUP SQR BY c; c arg max d 2 i w w c c LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM);
62 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, isum(i w i d w c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS wd i d c c2 ; SQR g = GROUP SQR BY c; c arg max d 2 i w w c c LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; c SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM);
63 k-means in Apache Pig: E-step D_C = JOIN C BY w, D BY w; PROD = FOREACH D_C GENERATE d, c, i d * i c AS i d i c ; PROD g = GROUP PROD BY (d, c); DOT_PROD = FOREACH PROD g GENERATE d, c, SUM(i d i c ) AS dxc; SQR = FOREACH C GENERATE c, i c * i c AS i c2 ; SQR g = GROUP SQUA BY c; LEN_C = FOREACH SQR g GENERATE c, SQRT(SUM(i c2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c; SIM = FOREACH DOT_LEN GENERATE d, c, dxc / len c ; SIM g = GROUP SIM BY d; CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM);
64 k-means in Apache Pig: M-step D_C_W = JOIN CLUSTERS BY d, D BY d; D_C_W g = GROUP D_C_W BY (c, w); SUMS = FOREACH D_C_W g GENERATE c, w, SUM(i d ) AS sum; D_C_W gg = GROUP D_C_W BY c; SIZES = FOREACH D_C_W gg GENERATE c, COUNT(D_C_W) AS size; SUMS_SIZES = JOIN SIZES BY c, SUMS BY c; C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS i c ;
65 MapReduce job setup time In an iterative process, setting up a MapReduce job at each iteration is costly Solution: forward scheduling Panda et al, Chapter 2 Setup the next job before the previous completed Setup Process Tear down Data Data Setup Process Tear down Data Data Setup Process Tear down
66 k-means in DryadLINQ Budiu et al, Chapter 3 Vector NearestCenter(Vector point, IQueryable<Vector> centers) { var nearest = centers.first(); foreach (var center in centers) if ((point - center).norm() < (point - nearest).norm()) nearest = center; return nearest; } IQueryable<Vector> KMeansStep(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.groupby(vector => NearestCenter(vector, centers)).select(g => g.aggregate((x,y) => x+y) / g.count()); }
67 DryadLINQ: k-means execution plan
68 Takeaways on MapReduce wrappers Machine learning in SQL is fairly awkward DryadLINQ looks much more suitable Beta available at Check out Chapter 3 for a Kinect application!!! Writing high-level code requires deep understanding of low-level processes
69 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
70 HPC clusters High Performance Computing clusters / blades / supercomputers Thousands of cores Great variety of architectural choices Disk organization, cache, communication etc. Fault tolerance mechanisms are not crucial Hardware failures are rare Most typical communication protocol: MPI Message Passing Interface Gropp et al, MIT Press 1994
71 Message Passing Interface (MPI) Runtime communication library Available for many programming languages MPI_Bsend(void* buffer, int size, int destid) Serialization is on you MPI_Recv(void* buffer, int size, int sourceid) Will wait until receives it MPI_Bcast broadcasts a message MPI_Barrier synchronizes all processes
72 MapReduce vs. MPI MPI is a generic framework Processes send messages to other processes Any computation graph can be built Most suitable for the master/slave model
73 k-means using MPI Slaves read data portions Master broadcasts centroids to slaves Slaves assign data instances to clusters Slaves compute new local centroids and local cluster sizes Then send them to the master Pednault et al, Chapter 4 Master aggregates local centroids weighted by local cluster sizes into new global centroids
74 Two features of MPI parallelization State-preserving processes Processes can live as long as the system runs No need to read the same data again and again All necessary parameters can be preserved locally Hierarchical master/slave paradigm Pednault et al, Chapter 4 A slave can be a master of other processes Could be very useful in dynamic resource allocation When a slave recognizes it has too much stuff to process
75 Takeaways on MPI Old, well established, well debugged Very flexible Perfectly suitable for iterative processing Fault intolerant Not that widely available anymore An open source implementation: OpenMPI MPI can be deployed on Hadoop Ye et al, CIKM 2009
76 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
77 Multicore One machine, up to dozens of cores Shared memory, one disk Multithreading as a parallelization scheme Data might not fit the RAM Use streaming to process the data in portions Disk access may be the bottleneck If it does fit, RAM access is the bottleneck Use uniform, small size memory requests Tatikonda & Parthasarathy, Chapter 20
78 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
79 Graphics Processing Unit (GPU) GPU has become General-Purpose (GP-GPU) CUDA is a GP-GPU programming framework Powered by NVIDIA Each GPU consists of hundreds of multiprocessors Each multiprocessor consists of a few ALUs ALUs execute the same line of code synchronously When code branches, some multiprocessors stall Avoid branching as much as possible
80 Machine learning with GPUs To fully utilize a GPU, the data needs to fit in RAM This limits the maximal size of the data GPUs are optimized for speed A good choice for real-time tasks A typical usecase: a model is trained offline and then applied in real-time (inference) Machine vision / speech recognition are example domains Coates et al, Chapter 18 Chong et al, Chapter 21
81 k-means clustering on a GPU Cluster membership assignment done on GPU: Centroids are uploaded to every multiprocessor A multiprocessor works on one data vector at a time Each ALU works on one data dimension Hsu et al, Chapter 5 Centroid recalculation is then done on CPU Most appropriate for processing dense data Scattered memory access should be avoided A multiprocessor reads a data vector while its ALUs process a previous vector
82 Performance results 4 millions 8-dimensional vectors 400 clusters 50 k-means iterations 9 seconds!!!
83 Parallelization: platform choices Platform Communication Scheme Data size Peer-to-Peer TCP/IP Petabytes Virtual Clusters MapReduce / MPI Terabytes HPC Clusters MPI / MapReduce Terabytes Multicore Multithreading Gigabytes GPU CUDA Gigabytes FPGA HDL Gigabytes
84 Field-programmable gate array (FPGA) Highly specialized hardware units Programmable in Hardware Description Language (HDL) Applicable to training and inference Durdanovic et al, Chapter 7 Farabet et al, Chapter 19 Check out Chapter 7 for a hybrid parallelization: multicore (coarse-grained) + FPGA (fine-grained)
85 How to choose a platform Obviously depending on the size of the data A cluster is a better option if data doesn t fit in RAM Optimizing for speed or for throughput GPUs and FPGAs can reach enormous speeds Training a model / applying a model Training is usually offline
86 Thank You!
87 Parallel Information-Theoretic Co-Clustering Bekkerman & Scholz, Chapter 13
88 Illustration of Distributional Co-Clustering X Y
89 Illustration of Distributional Co-Clustering X Y Px ( 3, y8) 1 30 Px ( 3 ) Py ( 8 )
90 Illustration of Distributional Co-Clustering 1~x ~ x2 ~ x3 1~y ~ y2 ~ y
91 Illustration of Distributional Co-Clustering 1~y ~ y2 ~ y3 1~x ~ x2 ~ x PY ( ~ x3 ) 0,01,
92 Illustration of Distributional Co-Clustering 1~y ~ y2 ~ y3 1~x PY ( ~ ~ x1 ).25,. 25,. 5 ~ x2 ~ x Px ( ~, ~ 2y2 ).1 Mutual Information pxy ( ~, ~ )log px ( ~, ~ y ( ~ ) ~ x X ~ y Y ~ p( ~ x) py)
93 Information-Theoretic Co-Clustering Construct clusterings of X and Y by optimizing Mutual Information argmax I ~ ; ~ Y arg max ~, ~ (~,~)log p px x ( ~,~) p y X ~ pxy, Y ~ X XY ~ x X ~ ~ y y Y~ (~) (~ )
94 Two optimization strategies Both strategies: Randomly initialize clusterings of X and Y Alternate reclustering wrt and wrt Strategy 1: Centroid-based X ~ Y ~ Y ~ X ~ At iteration t, assign each x to cluster x ~ with arg min ~ xdkl PY ( ~ x) P( t) ( Y ~ ~ x ) ( )1 ( Y ~ ~ x) Dhillon et al, KDD 2003 P t ~ Compute for each new cluster x centroid
95 Sequential Co-Clustering For each x: Remove x from its original cluster ~ For each cluster x: Compute the delta in the Mutual Information ~ if x is assigned to x Assign x to the cluster such that the delta is maximal
96 Sequential vs. centroid-based updates
97 Sequential vs. centroid-based updates
98 Theoretical results in a nutshell The centroid-based algorithm misses updates Sequential CC updates more aggressively & faster Theorem: Sequential CC has a true subset of local optima compared to centroid-based IT-CC
99 Results on small data sets Dataset centroid sequential acheyer mgondek sanders-r NG Does the sequential strategy work better on large sets?
100 From inherently sequential to parallel (~,~)log px ( ~,~) y px ( ~,~)log y p px x p y y ~ 1 (~) ( ~,~) 1 (~ ) ( px ( ~,~)log y px y px py ~ 4 (~) ( ~,~) 4 (~ ) Objective function: I( ~ ; Y ~ ) px y ~ y Y 1 ~ y Y 4 X ~ x X ~ ~ y Y ~ p(~) x p(~ y) px ~,~)log y px y ~ y Y ~ 2 px (~) ( ~,~) 2 2 p(~ y ) px (~,~)log y px y ~ y Y ~ 3 px (~) ( ~,~) 3 3 p(~ y ) px ( ~,~)log y px y ~ y Y ~ 5 px (~) ( ~,~) 5 5 p(~ y ) px (~,~)log y px y ~ y px py Y ~ 6 (~) ( ~,~) 6 6 (~ )
101 Parallel sequential co-clustering Initialize clusters at random Split clusters to pairs Assign each pair to one machine Shuffle clusters in parallel Try moving each instance from one cluster to another Assign a different pair of clusters to each machine Repeat to cover all cluster pairs
102 How can we make sure that each cluster pair is generated exactly once? With minimal communication costs?
103 Tournament
104 Tournament
105 Tournament
106 Tournament
107 Tournament
108 Tournament
109 Tournament
110 DataLoom: Experimental Setup Parallelization of sequential co-clustering MPI-based implementation Centroid-based IT-CC: Implementation in MapReduce Double k-means: Co-clustering in Euclidean space
111 Overall costs per full iteration k : #clusters, m : #machines, v : #values in P(Y X) Communication: centroid-based O(2m k Y ) = O(m k 2 ) DataLoom O((k-1) (v/2)) = O(k v) In our experiments (many machines, sparse data): ~ (sending centroids) 2m Y v/2 (sending cluster) CPU cycles: O(k v) for both centroid-based & DataLoom ~ ~
112 Experimental results RCV1 dataset 800,000 docs 150,000 words 55 2 nd -level Reuters categories 800 document / 800 term clusters clustering without label information choose the mode of each cluster Netflix (KDD Cup 07) 18,000 movies 480,000 users Clustering binary rating matrix 800 movie / 800 user clusters for hold-out users: rank movies cluster-induced distribution: 112
113 Conclusion DataLoom: parallelization of sequential coclustering Theoretical result: Sequential updates superior for cluster updates Experimental results: Excellent clustering results on two large benchmarks
Large Scale Learning
Large Scale Learning Data hypergrowth: an example Reuters- 21578: about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job Mtle data: about
A survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: [email protected] Department
Other Map-Reduce (ish) Frameworks. William Cohen
Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci
Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University
Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University http://www.cs.wayne.edu/~reddy/ http://dmkd.cs.wayne.edu/tutorial/bigdata/ What is
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
A Survey on Scalable Big Data Analytics Platform
A Survey on Scalable Big Data Analytics Platform Ravindra Phule 1, Madhav Ingle 2 1 Savitribai Phule Pune University, Jayawantrao Sawant College of Engineering, Handewadi Road, Hadapsar 411028, India 2
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Survey of Hardware Platforms Available for Big Data Analytics Using K-means Clustering Algorithm
Survey of Hardware Platforms Available for Big Data Analytics Using K-means Clustering Algorithm Dr. M. Manimekalai 1, S. Regha 2 1 Director of MCA, Shrimati Indira Gandhi College, Trichy. India 2 Research
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Extending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster
Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Amresh Kumar Department of Computer Science & Engineering, Christ University Faculty of Engineering
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
