Parallel and Large Scale Learning with scikit-learn
|
|
|
- Roxanne Armstrong
- 10 years ago
- Views:
Transcription
1 Parallel and Large Scale Learning with scikit-learn Data Science London Meetup - Mar. 2013
2 About me Regular contributor to scikit-learn Interested in NLP, Computer Vision, Predictive Modeling & ML in general Interested in Cloud Tech and Scaling Stuff Starting my own ML consulting business:
3 Outline The Problem and the Ecosystem Scaling Text Classification Scaling Forest Models Introduction to IPython.parallel & StarCluster Scaling Model Selection & Evaluation
4 Parts of the Ecosystem Single Machine with Multiple Cores multiprocessing Multiple Machines with Multiple Cores
5 The Problem Big CPU (Supercomputers - MPI) Simulating stuff from models Big Data (Google scale - MapReduce) Counting stuff in logs / Indexing the Web Machine Learning? often somewhere in the middle
6 Cross Validation Input Data Labels to Predict
7 Cross Validation A B C A B C
8 Cross Validation A B C A B C Subset of the data used to train the model Held-out test set for evaluation
9 Cross Validation A B C A B C A C B A C B B C A B C A
10 Model Selection param_1 in [1, 10, 100] param_2 in [1e3, 1e4, 1e5] the Hyperparameters hell Find the best combination of parameters that maximizes the Cross Validated Score
11 Grid Search param_1 (1, 1e3) (10, 1e3) (100, 1e3) param_2 (1, 1e4) (10, 1e4) (100, 1e4) (1, 1e5) (10, 1e5) (100, 1e5)
12 (1, 1e3) (10, 1e3) (100, 1e3) (1, 1e4) (10, 1e4) (100, 1e4) (1, 1e5) (10, 1e5) (100, 1e5)
13 Grid Search: Qualitative Results
14 Grid Search: Cross Validated Scores
15 Parallel ML Use Cases Stateless Feature Extraction Model Assessment with Cross Validation Model Selection with Grid Search Bagging Models: Random Forests In-Loop Averaged Models
16 Embarrassingly Parallel ML Use Cases Stateless Feature Extraction Model Assessment with Cross Validation Model Selection with Grid Search Bagging Models: Random Forests In-Loop Averaged Models
17 Inter-Process Comm. Use Cases Stateless Feature Extraction Model Assessment with Cross Validation Model Selection with Grid Search Bagging Models: Random Forests In-Loop Averaged Models
18 Scaling Text Feature Extraction The Hashing Trick
19 (Count TfIdf)Vectorizer Scalability Issues Builds an In-Memory Vocabulary from text tokens to integer feature indices A Big Python dict: slow to (un)pickle Large Corpus: ~10^6 tokens Vocabulary == Statefulness == Sync barrier No easy way to run in parallel
20 >>> from sklearn.feature_extraction.text... import TfidfVectorizer >>> vec = TfidfVectorizer() >>> vec.fit(["the cat sat on the mat."]) >>> vec.vocabulary_ {u'cat': 0, u'mat': 1, u'on': 2, u'sat': 3, u'the': 4}
21 The Hashing Trick Replace the Python dict by a hash function: >>> from sklearn.utils.murmurhash import * >>> murmurhash3_bytes_u32('cat', 0) % 10 9L >>> murmurhash3_bytes_u32('sat', 0) % 10 0L Does not need any memory storage Hashing is stateless: can run in parallel!
22 >>> from sklearn.feature_extraction.text... import HashingVectorizer >>> vec = HashingVectorizer() >>> out = vec.transform([... "The cat sat on the mat."]) >>> out.shape (1, ) >>> out.nnz # number of non-zero elements 5
23 Some Numbers
24 TfidfVectorizer Loading 20 newsgroups dataset for all categories documents MB (training set) 7532 documents MB (testing set) Extracting features from the training dataset using a sparse vectorizer done in s at 1.712MB/s n_samples: 11314, n_features: Extracting features from the test dataset using the same vectorizer done in s at 3.413MB/s n_samples: 7532, n_features:
25 HashingVectorizer Loading 20 newsgroups dataset for all categories documents MB (training set) 7532 documents MB (testing set) Extracting features from the training dataset using a sparse vectorizer done in s at 4.176MB/s n_samples: 11314, n_features: Extracting features from the test dataset using the same vectorizer done in s at 4.044MB/s n_samples: 7532, n_features: 65536
26 HashingVectorizer on Amazon Reviews Music reviews: 216MB XML file 140MB raw text / 174,180 reviews: 53s Books reviews: 1.3GB XML file 900MB raw text / 975,194 reviews: ~6min
27 Parallel Text Classification
28 HowTo: Parallel Text Classification All Text Data All Labels to Predict
29 Partition the Text Data Text Data 1 Text Data 2 Text Data 3 Labels 1 Labels 2 Labels 3
30 Vectorizer in Parallel Text Data 1 Text Data 2 Text Data 3 Labels 1 Labels 2 Labels 3 vec vec vec Vec Data 1 Labels 1 Vec Data 2 Labels 2 Text Data 3 Labels 3
31 Train Linear Models in Parallel Text Data 1 Text Data 2 Text Data 3 Labels 1 Labels 2 Labels 3 vec vec vec Vec Data 1 Labels 1 Vec Data 2 Labels 2 Text Data 3 Labels 3 clf_1 clf_2 clf_3
32 Collect Models and Average clf = ( clf_1 + clf_2 + clf_3 ) / 3
33 Averaging Linear Models >>> clf = clone(clf_1) >>> clf.coef_ += clf_2.coef_ >>> clf.coef_ += clf_3.coef_ >>> clf.intercept_ += clf_2.intercept_ >>> clf.intercept_ += clf_3.intercept_ >>> clf.coef_ /= 3; clf.intercept_ /= 3
34 Averaging Linear Models >>> clf = clone(clf_1) >>> clf.coef_ += clf_2.coef_ >>> clf.coef_ += clf_3.coef_ >>> clf.intercept_ += clf_2.intercept_ >>> clf.intercept_ += clf_3.intercept_ >>> clf.coef_ /= 3; clf.intercept_ /= 3
35 Averaging Linear Models >>> clf = clone(clf_1) >>> clf.coef_ += clf_2.coef_ >>> clf.coef_ += clf_3.coef_ >>> clf.intercept_ += clf_2.intercept_ >>> clf.intercept_ += clf_3.intercept_ >>> clf.coef_ /= 3; clf.intercept_ /= 3
36 Averaging Linear Models >>> clf = clone(clf_1) >>> clf.coef_ += clf_2.coef_ >>> clf.coef_ += clf_3.coef_ >>> clf.intercept_ += clf_2.intercept_ >>> clf.intercept_ += clf_3.intercept_ >>> clf.coef_ /= 3; clf.intercept_ /= 3
37 Training Forest Models in Parallel
38 Tricks Try: ExtraTreesClassifier instead of: RandomForestClassifier Faster to train Sometimes better generalization too Both kind of Forest Models are naturally embarrassingly parallel models.
39 HowTo: Parallel Forests All Data All Labels to Predict
40 Partition Replicate the Dataset All Data All Data All Data All Labels All Labels All Labels
41 Train Forest Models in Parallel All Data All Data All Data All Labels All Labels All Labels clf_1 clf_2 clf_3 Seed each model with a different random_state integer!
42 Collect Models and Combine Forest Models naturally do the averaging at prediction time. clf = ( clf_1 + clf_2 + clf_3 ) >>> clf = clone(clf_1) >>> clf.estimators_ += clf_2.estimators_ >>> clf.estimators_ += clf_3.estimators_
43 What if my data does not fit in memory?
44 HowTo: Parallel Forests (for large datasets) All Data All Labels to Predict
45 Partition Replicate Partition the Dataset Data 1 Data 2 Data 3 Labels 1 Labels 2 Labels 3
46 Train Forest Models in Parallel Data 1 Data 2 Data 3 Labels 1 Labels 2 Labels 3 clf_1 clf_2 clf_3
47 Collect Models and Sum clf = ( clf_1 + clf_2 + clf_3 ) >>> clf = clone(clf_1) >>> clf.estimators_ += clf_2.estimators_ >>> clf.estimators_ += clf_3.estimators_
48 Warning Models trained on the partitioned dataset are not exactly equivalent of models trained on the unpartitioned dataset If very much data: does not matter much in practice: Gilles Louppe & Pierre Geurts
49 Implementing Parallelization with Python
50 Single Machine with Multiple Cores
51 multiprocessing >>> from multiprocessing import Pool >>> p = Pool(4) >>> p.map(type, [1, 2., '3']) [int, float, str] >>> r = p.map_async(type, [1, 2., '3']) >>> r.get() [int, float, str]
52 multiprocessing Part of the standard lib Simple API Cross-Platform support (even Windows!) Some support for shared memory Support for synchronization (Lock)
53 multiprocessing: limitations No docstrings in the source code! Very tricky to use the shared memory values with NumPy Bad support for KeyboardInterrupt fork without exec on POSIX
54 transparent disk-caching of the output values and lazy re-evaluation (memoization) easy simple parallel computing logging and tracing of the execution
55 joblib.parallel >>> from os.path.join >>> from joblib import Parallel, delayed >>> Parallel(2)(delayed(join)('/ect', s)... for s in 'abc') ['/ect/a', '/ect/b', '/ect/c']
56 Usage in scikit-learn Cross Validation cross_val(model, X, y, n_jobs=4, cv=3) Grid Search GridSearchCV(model, n_jobs=4, cv=3).fit(x, y) Random Forests RandomForestClassifier(n_jobs=4).fit(X, y)
57 joblib.parallel: shared memory (dev) >>> from joblib import Parallel, delayed >>> import numpy as np >>> Parallel(2, max_nbytes=1e6)(... delayed(type)(np.zeros(int(i)))... for i in [1e4, 1e6]) [<type 'numpy.ndarray'>, <class 'numpy.core.memmap.memmap'>]
58 Only 3 allocated datasets shared by all the concurrent workers performing the grid search. (1, 1e3) (10, 1e3) (100, 1e3) (1, 1e4) (10, 1e4) (100, 1e4) (1, 1e5) (10, 1e5) (100, 1e5)
59 Problems with multiprocessing & joblib Current Implementation uses fork without exec under Unix Break some optimized runtimes: OpenBlas Grand Central Dispatch under OSX Will be fixed in Python 3 at some point...
60 Multiple Machines with Multiple Cores
61 IPython.parallel Parallel Processing Library Interactive Exploratory Shell Multi Core & Distributed
62 Working in the Cloud Launch a cluster of machines in one cmd: $ starcluster start mycluster -s 3 \ -b force-spot-master $ starcluster sshmaster mycluster Supports Spot Instances provisioning Ships blas, atlas, numpy, scipy IPython plugin, Hadoop plugin and more
63 [global] DEFAULT_TEMPLATE=ip [key mykey] KEY_LOCATION=~/.ssh/mykey.rsa [plugin ipcluster] SETUP_CLASS = starcluster.plugins.ipcluster.ipcluster ENABLE_NOTEBOOK = True [plugin packages] setup_class = pypackage.pypackagesetup packages = msgpack-python, scikit-learn [cluster ip] KEYNAME = mykey CLUSTER_USER = ipuser NODE_IMAGE_ID = ami-999d49f0 NODE_INSTANCE_TYPE = c1.xlarge DISABLE_QUEUE = True SPOT_BID = 0.10 PLUGINS = packages, ipcluster
64 $ starcluster start -s 3 --force-spot-master demo_cluster StarCluster - ( (v ) Software Tools for Academics and Researchers (STAR) Please submit bug reports to [email protected] >>> Using default cluster template: ip >>> Validating cluster template settings... >>> Cluster template settings are valid >>> Starting cluster... >>> Launching a 3-node cluster... >>> Launching master node (ami: ami-999d49f0, type: c1.xlarge)... >>> Creating security SpotInstanceRequest:sir-d10e3412 >>> Launching node001 (ami: ami-999d49f0, type: c1.xlarge) SpotInstanceRequest:sir-3cad4812 >>> Launching node002 (ami: ami-999d49f0, type: c1.xlarge) SpotInstanceRequest:sir-1a >>> Waiting for cluster to come up... (updating every 5s) >>> Waiting for open spot requests to become active... 3/3 100% >>> Waiting for all nodes to be in a 'running' state... 3/3 100% >>> Waiting for SSH to come up on all nodes... 3/3 100% >>> Waiting for cluster to come up took mins >>> The master node is ec compute-1.amazonaws.com
65 >>> Configuring cluster... >>> Running plugin starcluster.clustersetup.defaultclustersetup >>> Configuring hostnames... 3/3 100% >>> Creating cluster user: ipuser (uid: 1001, gid: 1001) 3/3 100% >>> Configuring scratch space for user(s): ipuser 3/3 100% >>> Configuring /etc/hosts on each node 3/3 100% >>> Starting NFS server on master >>> Configuring NFS exports path(s): /home >>> Mounting all NFS export path(s) on 2 worker node(s) 2/2 100% >>> Setting up NFS took mins >>> Configuring passwordless ssh for root >>> Configuring passwordless ssh for ipuser >>> Running plugin ippackages >>> Installing Python packages on all nodes: >>> $ pip install -U msgpack-python >>> $ pip install -U scikit-learn >>> Installing 2 python packages took 1.12 mins
66 >>> Running plugin ipcluster >>> Writing IPython cluster config files >>> Starting the IPython controller and 7 engines on master >>> Waiting for JSON connector file... /Users/ogrisel/.starcluster/ipcluster/SecurityGroup:@sc-demo_cluster-us-east-1.json 100% Time: 00:00: B/s >>> Authorizing tcp ports [ ] on /0 for: IPython controller >>> Adding 16 engines on 2 nodes 2/2 100% >>> Setting up IPython web notebook for user: ipuser >>> Creating SSL certificate for user ipuser >>> Authorizing tcp ports [ ] on /0 for: notebook >>> IPython notebook URL: >>> The notebook password is: zyhomhea8rtjscxj *** WARNING - Please check your local firewall settings if you're having *** WARNING - issues connecting to the IPython notebook >>> IPCluster has been started on SecurityGroup:@sc-demo_cluster for user 'ipuser' with 23 engines on 3 nodes. To connect to cluster from your local machine use: from IPython.parallel import Client client = Client('/Users/ogrisel/.starcluster/ipcluster/SecurityGroup:@sc-demo_cluster-useast-1.json', sshkey='/users/ogrisel/.ssh/mykey.rsa') See the IPCluster plugin doc for usage details: >>> IPCluster took mins >>> Configuring cluster took mins >>> Starting cluster took mins
67 Demo!
68 Perspectives
69 2012 results by Stanford / Google
70 The YouTube Neuron
71 Thanks
72 If we had more time...
73 MapReduce? mapper [ [ [ (k1, v1), (k3, v3), reducer (k5, v6), (k2, v2), mapper (k4, v4), (k6, v6), reducer... ] ] ] mapper
74 Why MapReduce does not always work Write a lot of stuff to disk for failover Inefficient for small to medium problems [(k, v)] mapper [(k, v)] reducer [(k, v)] Data and model params as (k, v) pairs? Complex to leverage for Iterative Algorithms
75 When MapReduce is useful for ML Data Preprocessing & Feature Extraction Parsing, Filtering, Cleaning Computing big JOINs & Aggregates Random Sampling Computing ensembles on partitions
76 The AllReduce Pattern Compute an aggregate (average) of active node data Do not clog a single node with incoming data transfer Traditionally implemented in MPI systems
77 AllReduce 0/3 Initial State Value: 1.0 Value: 2.0 Value: 0.5 Value: 1.1 Value: 3.2 Value: 0.9
78 AllReduce 1/3 Spanning Tree Value: 1.0 Value: 2.0 Value: 0.5 Value: 1.1 Value: 3.2 Value: 0.9
79 AllReduce 2/3 Upward Averages Value: 1.0 Value: 2.0 Value: 0.5 Value: 1.1 (1.1, 1) Value: 3.2 (3.1, 1) Value: 0.9 (0.9, 1)
80 AllReduce 2/3 Upward Averages Value: 1.0 Value: 2.0 (2.1, 3) Value: 0.5 (0.7, 2) Value: 1.1 (1.1, 1) Value: 3.2 (3.1, 1) Value: 0.9 (0.9, 1)
81 AllReduce 2/3 Upward Averages Value: 1.0 (1.38, 6) Value: 2.0 (2.1, 3) Value: 0.5 (0.7, 2) Value: 1.1 (1.1, 1) Value: 3.2 (3.1, 1) Value: 0.9 (0.9, 1)
82 AllReduce 3/3 Downward Updates Value: 1.38 Value: 2.0 (2.1, 3) Value: 0.5 (0.7, 2) Value: 1.1 (1.1, 1) Value: 3.2 (3.1, 1) Value: 0.9 (0.9, 1)
83 AllReduce 3/3 Downward Updates Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.1 (1.1, 1) Value: 3.2 (3.1, 1) Value: 0.9 (0.9, 1)
84 AllReduce 3/3 Downward Updates Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38
85 AllReduce Final State Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38 Value: 1.38
86 AllReduce Implementations IPC directly w/ IPython.parallel master/docs/examples/parallel/interengine
87 Killall IPython engines on StarCluster [plugin ipcluster] SETUP_CLASS = starcluster.plugins.ipcluster.ipcluster ENABLE_NOTEBOOK = True NOTEBOOK_DIRECTORY = notebooks [plugin ipclusterrestart] SETUP_CLASS = starcluster.plugins.ipcluster.ipclusterrestartengines
88 $ starcluster runplugin ipclusterrestart demo_cluster StarCluster - ( (v ) Software Tools for Academics and Researchers (STAR) Please submit bug reports to [email protected] >>> Running plugin ipclusterrestart >>> Restarting 23 engines on 3 nodes 3/3 100%
Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014
Machine Learning in Python with scikit-learn O Reilly Webcast Aug. 2014 Outline Machine Learning refresher scikit-learn How the project is structured Some improvements released in 0.15 Ongoing work for
Big Data Paradigms in Python
Big Data Paradigms in Python San Diego Data Science and R Users Group January 2014 Kevin Davenport! http://kldavenport.com [email protected] @KevinLDavenport Thank you to our sponsors: Setting up
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
HDFS Cluster Installation Automation for TupleWare
HDFS Cluster Installation Automation for TupleWare Xinyi Lu Department of Computer Science Brown University Providence, RI 02912 [email protected] March 26, 2014 Abstract TupleWare[1] is a C++ Framework
Assignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Mark Bennett. Search and the Virtual Machine
Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Clusters in the Cloud
Clusters in the Cloud Dr. Paul Coddington, Deputy Director Dr. Shunde Zhang, Compu:ng Specialist eresearch SA October 2014 Use Cases Make the cloud easier to use for compute jobs Par:cularly for users
Oracle Database Security and Audit
Copyright 2014, Oracle Database Security and Audit Beyond Checklists Learning objectives Understand Oracle architecture Database Listener Oracle connection handshake Client/server architecture Authentication
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Amazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
Introduction to Highly Available NFS Server on scale out storage systems based on GlusterFS
Introduction to Highly Available NFS Server on scale out storage systems based on GlusterFS Soumya Koduri Red Hat Meghana Madhusudhan Red Hat AGENDA What is GlusterFS? Integration with NFS Ganesha Clustered
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida
Amazon Web Services Primer William Strickland COP 6938 Fall 2012 University of Central Florida AWS Overview Amazon Web Services (AWS) is a collection of varying remote computing provided by Amazon.com.
USER CONFERENCE 2011 SAN FRANCISCO APRIL 26 29. Running MarkLogic in the Cloud DEVELOPER LOUNGE LAB
USER CONFERENCE 2011 SAN FRANCISCO APRIL 26 29 Running MarkLogic in the Cloud DEVELOPER LOUNGE LAB Table of Contents UNIT 1: Lab description... 3 Pre-requisites:... 3 UNIT 2: Launching an instance on EC2...
Virtual Appliance Setup Guide
Virtual Appliance Setup Guide 2015 Bomgar Corporation. All rights reserved worldwide. BOMGAR and the BOMGAR logo are trademarks of Bomgar Corporation; other trademarks shown are the property of their respective
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
OpenTOSCA Release v1.1. Contact: [email protected] Documentation Version: March 11, 2014 Current version: http://files.opentosca.
OpenTOSCA Release v1.1 Contact: [email protected] Documentation Version: March 11, 2014 Current version: http://files.opentosca.de NOTICE This work has been supported by the Federal Ministry of Economics
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Lecture 6 Cloud Application Development, using Google App Engine as an example
Lecture 6 Cloud Application Development, using Google App Engine as an example 922EU3870 Cloud Computing and Mobile Platforms, Autumn 2009 (2009/10/19) http://code.google.com/appengine/ Ping Yeh ( 葉 平
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Unlocking the True Value of Hadoop with Open Data Science
Unlocking the True Value of Hadoop with Open Data Science Kristopher Overholt Solution Architect Big Data Tech 2016 MinneAnalytics June 7, 2016 Overview Overview of Open Data Science Python and the Big
VMware vcenter Log Insight Getting Started Guide
VMware vcenter Log Insight Getting Started Guide vcenter Log Insight 2.0 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Cloud Computing for Research. Jeff Barr - January 2011
Cloud Computing for Research Jeff Barr - January 2011 The Cloud is Suddenly Everywhere Current Research Challenges There is never enough: Time Money CPU power Storage Physical space Power or cooling How
Equalizer VLB Beta I. Copyright 2008 Equalizer VLB Beta I 1 Coyote Point Systems Inc.
Equalizer VLB Beta I Please read these instructions completely before you install and configure Equalizer VLB. After installation, see the Help menu for Release Notes and the Installation and Administration
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
VMware Data Recovery. Administrator's Guide EN-000193-00
Administrator's Guide EN-000193-00 You can find the most up-to-date technical documentation on the VMware Web site at: http://www.vmware.com/support/ The VMware Web site also provides the latest product
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
VMware vcenter Log Insight Getting Started Guide
VMware vcenter Log Insight Getting Started Guide vcenter Log Insight 1.5 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Cloudera Manager Training: Hands-On Exercises
201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Sriram Krishnan, Ph.D. [email protected]
Sriram Krishnan, Ph.D. [email protected] (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services
Reference Architecture Developing Storage Solutions with Intel Cloud Edition for Lustre* and Amazon Web Services Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud
Tutorial: Using HortonWorks Sandbox 2.3 on Amazon Web Services
Tutorial: Using HortonWorks Sandbox 2.3 on Amazon Web Services Sayed Hadi Hashemi Last update: August 28, 2015 1 Overview Welcome Before diving into Cloud Applications, we need to set up the environment
Ceph. A complete introduction.
Ceph A complete introduction. Itinerary What is Ceph? What s this CRUSH thing? Components Installation Logical structure Extensions Ceph is An open-source, scalable, high-performance, distributed (parallel,
Apache Traffic Server Extensible Host Resolution
Apache Traffic Server Extensible Host Resolution at ApacheCon NA 2014 Speaker Alan M. Carroll, Apache Member, PMC Started working on Traffic Server in summer 2010. Implemented Transparency, IPv6, range
Healthstone Monitoring System
Healthstone Monitoring System Patrick Lambert v1.1.0 Healthstone Monitoring System 1 Contents 1 Introduction 2 2 Windows client 2 2.1 Installation.............................................. 2 2.2 Troubleshooting...........................................
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Discovery Guide. Secret Server. Table of Contents
Secret Server Discovery Guide Table of Contents Introduction... 3 How Discovery Works... 3 Active Directory / Local Windows Accounts... 3 Unix accounts... 3 VMware ESX accounts... 3 Why use Discovery?...
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Unitrends Virtual Backup Installation Guide Version 8.0
Unitrends Virtual Backup Installation Guide Version 8.0 Release June 2014 7 Technology Circle, Suite 100 Columbia, SC 29203 Phone: 803.454.0300 Contents Chapter 1 Getting Started... 1 Version 8 Architecture...
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Intellicus Enterprise Reporting and BI Platform
Intellicus Cluster and Load Balancer Installation and Configuration Manual Intellicus Enterprise Reporting and BI Platform Intellicus Technologies [email protected] www.intellicus.com Copyright 2012
VMware Identity Manager Connector Installation and Configuration
VMware Identity Manager Connector Installation and Configuration VMware Identity Manager This document supports the version of each product listed and supports all subsequent versions until the document
Comsol Multiphysics. Running COMSOL on the Amazon Cloud. VERSION 4.3a
Comsol Multiphysics Running COMSOL on the Amazon Cloud VERSION 4.3a Running COMSOL on the Amazon Cloud 1998 2012 COMSOL Protected by U.S. Patents 7,519,518; 7,596,474; and 7,623,991. Patents pending. This
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand
Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams
Neptune A Domain Specific Language for Deploying HPC Software on Cloud Platforms Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams ScienceCloud 2011 @ San Jose, CA June 8, 2011 Cloud Computing Three
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Alfresco Enterprise on AWS: Reference Architecture
Alfresco Enterprise on AWS: Reference Architecture October 2013 (Please consult http://aws.amazon.com/whitepapers/ for the latest version of this paper) Page 1 of 13 Abstract Amazon Web Services (AWS)
Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage
Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage Release: August 2011 Copyright Copyright 2011 Gluster, Inc. This is a preliminary document and may be changed substantially prior to final commercial
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
