Deep Learning with H2O. Arno Candel & Viraj Parmar
|
|
|
- Daniel Green
- 10 years ago
- Views:
Transcription
1 Deep Learning with H2O Arno Candel & Viraj Parmar Published by H2O 2015
2 Deep Learning with H2O Arno Candel & Viraj Parmar February 2015: Second Edition
3 Contents 1 What is H2O? Introduction Installation Support Deep learning overview H2O s Deep Learning Architecture Summary of features Training protocol Initialization Activation and loss functions Parallel distributed network training Specifying the number of training samples per iteration Regularization Advanced optimization Momentum training Rate annealing Adaptive learning Loading data Standardization Additional parameters Deep Learning with H2O by Arno Candel & Viraj Parmar Published by H2O.ai, Inc Leghorn Street Mountain View, CA H2O.ai Inc. All Rights Reserved. Cover illustration inspired by Momoko Sudo February 2015: Second Edition Use Case: MNIST Digit Classification MNIST overview Performing a trial run Extracting and handling the results Web interface Variable importances Java model Grid search for model comparison Checkpoint model Achieving state-of-the-art performance Photos by copyright H2O.ai, Inc. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein Deep Autoencoders Nonlinear dimensionality reduction Use case: anomaly detection Appendix A: Complete parameter list Printed in the United States of America 7 Appendix B: References 32
4 1 What is H2O? H2O is fast scalable open source machine learning and deep learning for Smarter Applications. With H2O enterprises like PayPal, Nielsen, Cisco and others can use all of their data without sampling and get accurate predictions faster. Advanced algorithms, like Deep Learning, Boosting and Bagging Ensembles are readily available for application designers to build smarter applications through elegant APIs. Some of our earliest customers have built powerful domain-specific predictive engines for Recommendations, Customer Churn, Propensity to Buy, Dynamic Pricing and Fraud Detection for the Insurance, Healthcare, Telecommunications, AdTech, Retail and Payment Systems. Using in-memory compression techniques, H2O can handle billions of data rows in-memory even with a fairly small cluster. The platform includes interfaces for R, Python, Scala, Java, JSON and Coffeescript/JavaScript, along with its built-in Flow web interface that make it easier for non-engineers to stitch together complete analytic workflows. The platform was built alongside (and on top of) both Hadoop and Spark Clusters and is typically deployed within minutes. H2O implements almost all common machine learning algorithms such as generalized linear modeling (linear regression, logistic regression, etc.), Naive Bayes, principal components analysis, time series, k-means clustering and others. H2O also implements best-in-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale. Customers can build thousands of models and compare them to get the best prediction results. H2O is nurturing a grassroots movement of physicists, mathematicians, computer and data scientists to herald the new wave of discovery with data science. Academic researchers and Industrial data scientists collaborate closely with our team to make this possible. Stanford university giants Stephen Boyd, Trevor Hastie, Rob Tibshirani advise the H2O team to build scalable machine learning algorithms. With 100s of meetups over the past two years, H2O has become a word-of-mouth phenomenon growing amongst the data community by a 100-fold and is now used by 12,000+ users, deployed in corporations using R, Python, Hadoop and Spark. Try it out H2O offers an R package that can be installed from CRAN. H2O can be downloaded from Join the community Connect with [email protected] and to learn about our meetups, training sessions, hackathons, and product updates. Learn more about H2O Visit
5 2 Introduction Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning has cracked the code for training stability and generalization and scales on big data. It is the algorithm of choice for highest predictive accuracy. This documentation presents the Deep Learning framework in H2O, as experienced through the H2O R interface. Further documentation on H2O's system and algorithms can be found at especially the "R User documentation", and fully featured tutorials are available at The datasets, R code and instructions for this document can be found at the H2O GitHub repository at DeepLearningRVignetteDemo. This introductory section provides instructions on getting H2O started from R, followed by a brief overview of deep learning. 2.1 Installation To install H2O, follow the "Download" link on at H2O's website at For multi-node operation, download the H2O zip file and deploy H2O on your cluster, following instructions from the "Full Documentation". For single-node operation, follow the instructions in the "Install in R" tab. Open your R Console and run the following to install and start H2O directly from R: # The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=true) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # Next, we download, install and initialize the H2O package for R (filling in the *'s with the latest version number obtained from the H2O download page) install.packages("h2o", repos=(c(" master/****/r", getoption("repos")))) library(h2o) Initialize H2O with h2o_server = h2o.init(nthreads = -1) With this command, the H2O R module will start an instance of H2O automatically at localhost: Alternatively, to specify a connection with an existing H2O cluster node (other than localhost at port 54321) you must explicitly state the IP address and port number in the h2o.init(nthreads = -1) call. An example is given
6 8 Deep Learning with H2O Introduction 9 below, but do not directly paste; you should specify the IP and port number appropriate to your specific environment. h2o_server = h2o.init(ip = " ", port = 12345, starth2o = FALSE, nthreads = -1) α = Σ n i=1 w i x i +b of input signals is aggregated, and then an output signal f(α) transmitted by the connected neuron. The function f represents the nonlinear activation function used throughout the network, and the bias b accounts for the neuron s activation threshold. An automatic demo is available to see h2o.deeplearning at work. Run the following command to observe an example binary classification model built through H2O s Deep Learning. demo(h2o.deeplearning) 2.2 Support Users of the H2O package may submit general inquiries and bug reports to the 0xdata support address. Alternatively, specific bugs or issues may be filed to the 0xdata JIRA. 2.3 Deep Learning Overview First we present a brief overview of deep neural networks for supervised learning tasks. There are several theoretical frameworks for deep learning, and here we summarize the feedforward architecture used by H2O. w 2 w n b w 1 The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human neuron. For humans, varying strengths of neurons output signals travel along the synaptic junctions and are then aggregated as input for a connected neuron s activation. In the model, the weighted combination Multi-layer, feedforward neural networks consist of many layers of interconnected neuron units: beginning with an input layer to match the feature space; followed by multiple layers of nonlinearity; and terminating with a linear regression or classification layer to match the output space. The inputs and outputs of the model s units follow the basic logic of the single neuron described above. Bias units are included in each non-output layer of the network. The weights linking neurons and biases with other neurons fully determine the output of the entire network, and learning occurs when these weights are adapted to minimize the error on labeled training data. More specifically, for each training example j the objective is to minimize a loss function L(W,B j). Here W is the collection {W i } 1:N-1, where W i denotes the weight matrix connecting layers i and i + 1 for a network of N layers; similarly B is the collection {b i } 1:N-1, where b i denotes the column vector of biases for layer i + 1. This basic framework of multi-layer neural networks can be used to accomplish deep learning tasks. Deep learning architectures are models of hierarchical feature extraction, typically involving multiple levels of nonlinearity. Such models are able to learn useful representations of raw data, and have exhibited high performance on complex data such as images, speech, and text (Bengio, 2009).
7 10 Deep Learning with H2O H2O's Deep Learning Architecture 11 3 H2O s Deep Learning Architecture As described above, H2O follows the model of multi-layer, feedforward neural networks for predictive modeling. This section provides a more detailed description of H2O s Deep Learning features, parameter configurations, and computational implementation. 3.1 Summary of features H2O s Deep Learning functionalities include: Purely supervised training protocol for regression and classification tasks fast and memory-efficient Java implementations based on columnar compression and fine-grain Map/Reduce multi-threaded and distributed parallel computation to be run on either a single node or a multi-node cluster fully automatic per-neuron adaptive learning rate for fast convergence optional specification of learning rate, annealing and momentum options regularization options include L1, L2, dropout, Hogwild! and model averaging to prevent model overfitting elegant web interface or fully scriptable R API from H2O CRAN package grid search for hyperparameter optimization and model selection model checkpointing for reduced run times and model tuning automatic data pre and post-processing for categorical and numerical data automatic imputation of missing values automatic tuning of communication vs computation for best performance model export in plain java code for deployment in production environments additional expert parameters for model tuning deep autoencoders for unsupervised feature learning and anomaly detection capabilities 3.2 Training protocol The training protocol described below follows many of the ideas and advances in the recent deep learning literature Initialization Various deep learning architectures employ a combination of unsupervised pretraining followed by supervised training, but H2O uses a purely supervised training protocol. The default initialization scheme is the uniform adaptive option, which is an optimized initialization based on the size of the network. Alternatively, you may select a random initialization to be drawn from either a uniform or normal distribution, for which a scaling parameter may be specified as well Activation and loss functions In the introduction we introduced the nonlinear activation function f, for which the choices are summarized in Table 1. Note here that x i and w i denote the ring neuron's input values and their weights, respectively; α denotes the weighted combination α = Σ n j=1 w i x i +b Table 1: Activation Functions Function Formula Range Tanh f( ) = eα e -α e α +e -α f( ) [ 1,1] Rectified Linear f( ) = max(0, α) f( ) R + Maxout f( ) = max(w i x i +b), rescale if max f( ) 1 f( ) [,1] The tanh function is a rescaled and shifted logistic function and its symmetry around 0 allows the training algorithm to converge faster. The rectified linear activation function has demonstrated high performance on image recognition tasks, and is a more biologically accurate model of neuron activations (LeCun et al, 1998). Maxout activation works particularly well with dropout, a regularization method discussed later in this vignette (Goodfellow et al, 2013). It is difficult to determine a "best" activation function to use; each may outperform the others in separate scenarios, but grid search models (also described later) can help to compare activation functions and other parameters. The default activation function is the Rectifier. Each of these activation functions can be operated with dropout regularization (see next page).
8 12 Deep Learning with H2O H2O's Deep Learning Architecture 13 The following choices for the loss function L(W,B j) are summarized in Table 2. The system default enforces the table's typical use rule based on whether regression or classification is being performed. Note here that t (j) and o (j) are the predicted (target) output and actual output, respectively, for training example j; further, let y denote the output units and O the output layer. Table 2: Loss Functions Function Formula Typical Use Mean Squared Error L(W,B j) = 1 ₂ t (j) o (j) ²₂ Regression Cross Entropy L(W,B j) = Σ ( ln(o (j) y ) t (j) y Ο Parallel distributed network training y + ln(1 o (j) y ) (1 t (j) y )) Classification The procedure to minimize the loss function L(W,B j) is a parallelized version of stochastic gradient descent (SGD). Standard SGD can be summarized as follows, with the gradient L(W,B j) computed via back propagation (LeCun et al, 1998). The constant α indicates the learning rate, which controls the step sizes during gradient descent. Standard Stochastic Gradient Descent Initialize W, B Iterate until convergence criterion reached Get training example i Update all the weights w jk W, biases b jk B L(W,B j) w jk : = w jk α b jk : = b jk α w jk L(W,B j) b jk Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable without becoming slow. We utilize hogwild!, the recently developed lock-free parallelization scheme from Niu et al, hogwild! follows a shared memory model where multiple cores, each handling separate subsets (or all) of the training data, are able to make independent contributions to the gradient updates L(W,B j) asynchronously. In a multi-node system this parallelization scheme works on top of H2O's distributed setup, where the training data is distributed across the cluster. Each node operates in parallel on its local data until the final parameters W,b are obtained by averaging. Below is a rough summary. Parallel distributed and multi-threaded training with SGD in H2O Deep Learning Initialize global model parameters W,B Distribute training data T across nodes (can be disjoint or replicated) Iterate until convergence criterion reached For nodes n with training subset T n, do in parallel: Obtain copy of the global model parameters W n,b n Select active subset T na T n (user-given number of samples per iteration) Partition T na into T nac by cores n c For cores n c on node n, do in parallel: Get training example i T nac Update all weights w jk w n, biases b jk B n L(W,B j) w jk : = w jk α w jk L(W,B j) b jk : = b jk α b jk Set W, B := Avg n W n ; Avg n B n Optionally score the model on (potentially sampled) train/validation scoring sets Here, the weights and bias updates follow the asynchronous hogwild! procedure to incrementally adjust each node's parameters W n B n after seeing example i. The Avg n notation refers to the final averaging of these local parameters across all nodes to obtain the global model parameters and complete training Specifying the number of training samples per iteration H2O Deep Learning is scalable and can take advantage of a large cluster of compute nodes. There are three modes in which to operate. The default behavior is to let every node train on the entire (replicated) dataset, but automatically locally shuffling (and/or using a subset of) the training examples for each iteration. For datasets that don't fit into each node's memory (also depending on the heap memory specified by the -Xmx option), it might not be possible to replicate the
9 14 Deep Learning with H2O H2O's Deep Learning Architecture 15 data, and each compute node can be instructed to train only with local data. An experimental single node mode is available for the case where slow final convergence is observed due to the presence of too many nodes, but we've never seen this become necessary. The number of training examples (globally) presented to the distributed SGD worker nodes between model averaging is controlled by the important parameter train_samples_per_iteration. One special value is -1, which results in all nodes processing all their local training data per iteration. Note that if replicate_training_data is enabled (true by default), this will result in training N epochs (passes over the data) per iteration on N nodes, otherwise 1 epoch will be trained per iteration. Another special value is 0, which always results in 1 epoch per iteration, independent of the number of compute nodes. In general, any user-given positive number is permissible for this parameter. For large datasets, it might make sense to specify a fraction of the dataset. For example, if the training data contains 10 million rows, and we specify the number of training samples per iteration as 100,000 when running on 4 nodes, then each node will process 25,000 examples per iteration, and it will take 40 such distributed iterations to process one epoch. If the value is set too high, it might take too long between synchronization and model convergence can be slow. If the value is set too low, network communication overhead will dominate the runtime, and computational performance will suffer. The special value of -2 (the default) enables auto-tuning of this parameter based on the computational performance of the processors and the network of the system and attempts to find a good balance between computation and communication. Note that this parameter can affect the convergence rate during training. 3.3 Regularization H2O's Deep Learning framework supports regularization techniques to prevent overfitting. l₁ and l₂ regularization enforce the same penalties as they do with other models, that is, modifying the loss function so as to minimize some Lʹ(W,B j) = L(W,B j) + λ 1 R 1 (W,B j) + λ 2 R 2 (W,B j) For l₁ regularization, R 1 (W,B j) represents of the sum of all l₁ norms of the weights and biases in the network; R 2 (W,B j) represents the sum of squares of all the weights and biases in the network. The constants λ 1 and λ 2 are generally chosen to be very small, for example The second type of regularization available for deep learning is a recent innovation called dropout (Hinton et al., 2012). Dropout constrains the online optimization such that during forward propagation for a given training example, each neuron in the network suppresses its activation with probability P, generally taken to be less than 0.2 for input neurons and up to 0.5 for hidden neurons. The effect is twofold: as with l 2 regularization, the network weight values are scaled toward 0; furthermore, each training example trains a different model, albeit sharing the same global parameters. Thus dropout allows an exponentially large number of models to be averaged as an ensemble, which can prevent overfitting and improve generalization. Note that input dropout can be especially useful when the feature space is large and noisy. 3.4 Advanced optimization H2O features manual and automatic versions of advanced optimization. The manual mode features include momentum training and learning rate annealing, while automatic mode features adaptive learning rate Momentum training Momentum modifies back-propagation by allowing prior iterations to influence the current update. In particular, a velocity vector v is defined to modify the updates as follows, with θ representing the parameters W,B;μ representing the momentum coefficient, and α denoting the learning rate. v t+1 = μv t α L(θ t ) θ t+1 = θ t + v t+1 Using the momentum parameter can aid in avoiding local minima and the associated instability (Sutskever et al, 2014). Too much momentum can lead to instabilities, which is why the momentum is best ramped up slowly.
10 16 Deep Learning with H2O H2O's Deep Learning Architecture 17 A recommended improvement when using momentum updates is the Nesterov accelerated gradient method. Under this method the updates are further modified such that Rate annealing Throughout training, as the model approaches a minimum the chance of oscillation or "optimum skipping" creates the need for a slower learning rate. Instead of specifying a constant learning rate α, learning rate annealing gradually reduces the learning rate α t to "freeze" into local minima in the optimization landscape (Zeiler, 2012). For H2O, the annealing rate is the inverse of the number of training samples it takes to cut the learning rate in half (e.g., 10 6 means that it takes 10 6 training samples to halve the learning rate) Adaptive learning v t+1 = μv t α L(θ t + μv t ) W t+1 = W t + v t+1 The implemented adaptive learning rate algorithm ADADELTA (Zeiler, 2012) automatically combines the benefits of learning rate annealing and momentum training to avoid slow convergence. Specification of only two parameters ρ and ϵ simplifies hyper parameter search. In some cases, manually controlled (nonadaptive) learning rate and momentum specifications can lead to better results, but require the hyperparameter search of up to 7 parameters. If the model is built on a topology with many local minima or long plateaus, it is possible for a constant learning rate to produce sub-optimal results. In general, however, we find adaptive learning rate to produce the best results, and this option is kept as the default. The first of two hyper parameters for adaptive learning is ρ. It is similar to momentum and relates to the memory to prior weight updates. Typical values are between 0.9 and The second of two hyper parameters ϵ for adaptive learning is similar to learning rate annealing during initial training and momentum at later stages where it allows forward progress. Typical values are between and Loading data Loading a dataset in R for use with H2O is slightly different from the usual methodology, as we must convert our datasets into H2OParsedData objects. For an example, we use a toy weather dataset included in the H2O GitHub repository for the H2O Deep Learning documentation at First load the data to your current working directory in your R Console (do this henceforth for dataset downloads), and then run the following command. weather.hex = h2o.uploadfile(h2o_server, path = "weather.csv", header = TRUE, sep= ",", key = "weather.hex") To see a quick summary of the data, run the following command. summary(weather.hex) Standardization Along with categorical encoding, H2O preprocesses data to be standardized for compatibility with the activation functions. Recall Table 1's summary of each activation function's target space. Since in general the activation function does not map into R, we first standardize our data to be drawn from N (0,1). Standardizing again after network propagation allows us to compute more precise errors in this standardized space rather than in the raw feature space. 3.6 Additional parameters This section has reviewed some background on the various parameter configurations in H2O's Deep Learning architecture. H2O Deep Learning models may seem daunting since there are dozens of possible parameter arguments when creating models. However, most parameters do not need to be tuned or experimented with; the default settings are safe and recommended. Those parameters for which experimentation is possible and perhaps necessary have mostly been discussed here but there a couple more which deserve mention. There is no default for both hidden layer size/number as well as epochs. Practice building deep learning models with different network topologies and different datasets will lead to intuition for these parameters but two general rules of thumb should be applied. First, choose larger network sizes, as they
11 18 Deep Learning with H2O Use Case: MNIST Digit Classification 19 can perform higher-level feature extraction, and techniques like dropout may train only subsets of the network at once. Second, use more epochs for greater predictive accuracy, but only when able to afford the computational cost. Many example tests can be found in the H2O Github repository for pointers on specific values and results for these (and other) parameters. For a full list of H2O Deep Learning model parameters and default values, see Appendix A. 4 Use Case: MNIST Digit Classification 4.1 MNIST overview The MNIST database is a famous academic dataset used to benchmark classification performance. The data consists of 60,000 training images and 10,000 test images, each a standardized 28 2 pixel greyscale image of a single handwritten digit. You can download the datasets from the H2O GitHub repository for the H2O Deep Learning documentation at Remember to save these.csv files to your working directory. Following the weather data example, we begin by loading these datasets into R as H2OParsedData objects. train_images.hex = h2o.uploadfile(h2o_server, path = "mnist_train.csv", header = FALSE, sep = ",", key = "train_images.hex") test_images.hex = h2o.uploadfile(h2o_server, path = "mnist_test.csv", header = FALSE, sep = ",", key = "test_images.hex") 4.2 Performing a trial run The trial run below is illustrative of the relative simplicity that underlies most H2O Deep Learning model parameter configurations, thanks to the defaults. We use the first 28 2 = 784 values of each row to represent the full image, and the final value to denote the digit class. As mentioned before, Rectified linear activation is popular with image processing and has performed well on the MNIST database previously; and dropout has been known to enhance performance on this dataset as well so we use train our model accordingly. #Train the model for digit classification mnist_model = h2o.deeplearning(x = 1:784, y = 785, data = train_images.hex, activation = "RectifierWithDropout", hidden = c(200,200,200), input_dropout_ratio = 0.2, l1 = 1e-5, validation = test_images.hex, epochs = 10) The model is run for only 10 epochs since it is meant just as a trial run. In this trial run we also specified the validation set to be the test set, but another option is to use n-fold validation by specifying, for example, nfolds=5 instead of validation=test_images Extracting and handling the results We can extract the parameters of our model, examine the scoring process, and make predictions on new data. #View the specified parameters of your deep learning model mnist_model@model$params #Examine the performance of the trained model mnist_model The latter command returns the trained model's training and validation error. The training error value is based on the parameter score_training_samples, which specifies the number of randomly sampled training points to be used for scoring; the default uses 10,000 points. The validation error is based on the parameter score_validation_samples, which controls the same value on the validation set and is set by default to be the entire validation set. In general choosing more sampled points leads to a better idea of the model's performance on your dataset; setting either of these parameters to 0 automatically uses the entire corresponding dataset for scoring. Either way, however, you can control the minimum and maximum time spent on scoring with the score_interval and score_duty_cycle parameters. These scoring parameters also affect the final model when the parameter override_with_best_model is turned on. This override sets the final model after training to be the model which achieved the lowest validation error during training, based on the sampled points used for scoring. Since the validation set is automatically set to be the training data if no other dataset is specified, either the score_training_samples or score_validation_samples parameter will control the error computation during training and, in turn, the chosen best model.
12 20 Deep Learning with H2O Use Case: MNIST Digit Classification 21 Once we have a satisfactory model, the h2o.predict() command can be used to compute and store predictions on new data, which can then be used for further tasks in the interactive data science process. #Perform classification on the test set prediction = h2o.predict(mnist_model, newdata=test_images.hex) #Copy predictions from H2O to R pred = as.data.frame(prediction) 4.3 Web interface H2O R users have access to a slick web interface to mirror the model building process in R. After loading data or training a model in R, point your browser to your IP address+port number (e.g., localhost:12345) to launch the web interface. From here you can click on admin > jobs to view your specific model details. You can also click on data > view all to view and keep track of your datasets in current use Variable importances One useful feature is the variable importances option, which can be enabled with the additional argument variable_importances=true. This features allows us to view the absolute and relative predictive strength of each feature in the prediction task. From R, you can access these strengths with the command mnist_model@model$varimp. You can also view a visualization of the variable importances on the web interface Java model Another important feature of the web interface is the Java (POJO) model, accessible from the java model button in the top right of a model summary page. This button allows access to Java code which, when called from a main method in a Java program, builds the model at hand. Instructions for downloading and running this Java code are available from the web interface, and example production scoring code is available as well. 4.4 Grid search for model comparison H2O supports grid search capabilities for model tuning by allowing users to tweak certain parameters and observe changes in model behavior. This is done by specifying sets of values for parameter arguments. For example, below is an example of a grid search: #Create a set of network topologies hidden_layers = list(c(200,200), c(100,300,100),c(500,500,500)) mnist_model_grid = h2o.deeplearning(x = 1:784, y = 785, data = train_images.hex, activation = "RectifierWithDropout", hidden = hidden_layers, validation = test_ images.hex, epochs = 1, 11 = c(1e-5,1e-7), input_dropout_ratio = 0.2) Here we specified three different network topologies and two different l₁ norm weights. This grid search model effectively trains six different models, over the possible combinations of these parameters. Of course, sets of other parameters can be specified for a larger space of models. This allows for more subtle insights in the model tuning and selection process, as we inspect and compare our trained models after the grid search process is complete. To decide how and when to choose different parameter configurations in a grid search, see Appendix A for parameter descriptions and possible values. #print out all prediction errors and run times of the models mnist_model_grid mnist_model_grid@model #print out a *short* summary of each of the models (indexed by parameter) mnist_model_grid@sumtable #print out *full* summary of each of the models all_params = lapply(mnist_model_grid@model, function(x) { x@model$params }) all_params #access a particular parameter across all models l1_params = lapply(mnist_model_grid@model, function(x) { x@model$params$l1 }) l1_params 4.5 Checkpoint model
13 22 Deep Learning with H2O Use Case: MNIST Digit Classification 23 Checkpoint model keys can be used to start off where you left off, if you feel that you want to further train a particular model with more iterations, more data, different data, and so forth. If we felt that our initial model should be trained further, we can use it (or its key) as a checkpoint argument in a new model. In the command below, mnist_model_grid@model[[1]] indicates the highest performance model from the grid search that we wish to train further. Note that the training and validation datasets and the response column etc. have to match for checkpoint restarts. mnist_checkpoint_model = h2o.deeplearning(x=1:784, y=785, data=train_images.hex, checkpoint=mnist_model_grid@model[[1]], validation = test_images.hex, epochs=9) Checkpoint models are also applicable for the case when we wish to reload existing models that were saved to disk in a previous session. For example, we can save and later load the best model from the grid search by running the following commands. #Specify a model and the file path where it is to be saved h2o.savemodel(object = mnist_model_grid@model[[1]], name = "/tmp/mymodel", force = TRUE) 4.6 Achieving world-record performance Without distortions, convolutions, or other advanced image processing techniques, the best-ever published test set error for the MNIST dataset is 0.83% by Microsoft. After training for 2,000 epochs (took about 4 hours) on 4 compute nodes, we obtain 0.87% test set error and after training for 8,000 epochs (took about 10 hours) on 10 nodes, we obtain 0.83% test set error, which is the current world-record, notably achieved using a distributed configuration and with a simple 1-liner from R. Details can be found in our hands-on tutorial. Accuracies around 1% test set errors are typically achieved within 1 hour when running on 1 node. The parallel scalability of H2O for the MNIST dataset on 1 to 63 compute nodes is shown in the figure below. Parallel Scalability (for 64 epochs on MNIST, with 0.83%" world-record parameters) 100 Training time in minutes Speedup #Alternatively, save the model key in some directory (here we use /tmp) #h2o.savemodel(object = mnist_model_grid@model[[1]], dir = "/tmp", force = TRUE) Later (e.g., after restarting H2O) we can load the saved model by indicating the host and saved model file path. This assumes the saved model was saved with a compatible H2O version (no changes to the H2O model implementation). best_mnist_grid.load = h2o.loadmodel(h2o_server, "/tmp/mymodel") mins H2O Nodes H2O Nodes #Continue training the loaded model best_mnist_grid.continue = h2o.deeplearning(x=1:784, y=785, data=train_ images.hex,checkpoint=best_mnist_grid.load, validation = test_images.hex, epochs=1) (4 cores per node, 1 epoch per node per MapReduce) Additionally, you can also use the command model = h2o.getmodel(h2o_server, key) to retrieve a model from its H2O key. This command is useful, for example, if you have created an H2O model using the web interface and wish to proceed with the modeling process in R.
14 24 Deep Learning with H2O Deep Autoencoders 25 5 Deep Autoencoders 5.1 Nonlinear dimensionality reduction So far we have discussed purely supervised deep learning tasks. However, deep learning can also be used for unsupervised feature learning or, more specifically, nonlinear dimensionality reduction (Hinton et al, 2006). Consider the diagram on the following page of a three-layer neural network with one hidden layer. If we treat our input data as labeled with the same input values, then the network is forced to learn the identity via a nonlinear, reduced representation of the original data. Such an algorithm is called a deep autoencoder; these models have been used extensively for unsupervised, layer-wise pretraining of supervised deep learning tasks, but here we consider the autoencoder's application for discovering anomalies in data. Each row represents a single heartbeat. The autoencoder is trained as follows: train_ecg.hex = h2o.uploadfile(h2o_server, path="ecg_train.csv", header=f, sep=",", key="train_ecg.hex") test_ecg.hex = h2o.uploadfile(h2o_server, path="ecg_test.csv", header=f, sep=",", key="test_ecg.hex") #Train deep autoencoder learning model on "normal" training data, y ignored anomaly_model = h2o.deeplearning(x=1:210, y=1, train_ecg.hex, activation = "Tanh", classification=f, autoencoder=t, hidden = c(50,20,50), l1=1e-4, epochs=100) #Compute reconstruction error with the Anomaly detection app (MSE between output layer and input layer) recon_error.hex = h2o.anomaly(test_ecg.hex, anomaly_model) #Pull reconstruction error data into R and plot to find outliers (last 3 heartbeats) recon_error = as.data.frame(recon_error.hex) recon_error plot.ts(recon_error) #Note: Testing = Reconstructing the test dataset test_recon.hex = h2o.predict(anomaly_model, test_ecg.hex) head(test_recon.hex) 5.2 Use case: anomaly detection Consider the deep autoencoder model described above. Given enough training data resembling some underlying pattern, the network will train itself to easily learn the identity when confronted with that pattern. However, if some "anomalous" test point not matching the learned pattern arrives, the autoencoder will likely have a high error in reconstructing this data, which indicates it is anomalous data. We use this framework to develop an anomaly detection demonstration using a deep autoencoder. The dataset is an ECG time series of heartbeats, and the goal is to determine which heartbeats are outliers. The training data (20 "good" heartbeats) and the test data (training data with 3 "bad" heartbeats appended for simplicity) can be downloaded from the H2O GitHub repository for the H2O Deep Learning documentation at
15 26 Deep Learning with H2O Appendix A: Complete Parameter List 27 6 Appendix A: Complete Parameter List x: A vector containing the names of the predictors in the model. No default. y: The name of the response variable in the model. No default. data: An H2OParsedData object containing the training data. No default. key: The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated. override_with_best_model: If enabled, override the final model with the best model found during training. Default is true. classification: A logical value indicating whether the algorithm should conduct classification. Otherwise, regression is performed on a numeric response variable. nfolds: Number of folds for cross-validation. If the number of folds is more than 1, then validation must remain empty. Default is false. validation: An H2OParsedData object indicating the validation dataset used to construct confusion matrix. If left blank, default is the training data. holdout_fraction: (Optional) Fraction of the training data to hold out for validation. checkpoint: Model checkpoint (either key or H2ODeepLearningModel) to resume training with. activation: The choice of nonlinear, differentiable activation function used throughout the network. Options are Tanh, TanhWithDropout, Rectifier, RectifierWithDropout, Maxout, MaxoutWithDropout, and the default is Rectifier. See section for more details. hidden: The number and size of each hidden layer in the model. For example, if c(100,200,100) is specified, a model with 3 hidden layers will be produced, and the middle hidden layer will have 200 neurons. The default is c(200,200). For grid search, use list(c(10,10), c(20,20))etc. See section 4.2 for more details. autoencoder: Default is false. See section 5 for more details. use_all_factor_levels: Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder. epochs: The number of passes over the training dataset to be carried out. It is recommended to start with lower values for initial grid searches. The value can be modified during checkpoint restarts and allows continuation of selected models. Default is 10. train_samples_per_iteration: Default is -1, but performance might depend greatly on this parameter. See section for more details. seed: The random seed controls sampling and initialization. Reproducible results are only expected with single-threaded operation (i.e. when running on one node, turning off load balancing and providing a small dataset that fits in one chunk). In general, the multithreaded asynchronous updates to the model parameters will result in (intentional) race conditions and nonreproducible results. Default is a randomly generated number. adaptive_rate: The default enables this feature for adaptive learning rate. See section for more details. rho: The first of two hyperparameters for adaptive learning rate (when it is enabled). This parameter is similar to momentum and relates to the memory to prior weight updates. Typical values are between 0.9 and Default value is See section for more details. epsilon: The second of two hyperparameters for adaptive learning rate (when it is enabled). This parameter is similar to learning rate annealing during initial training and momentum at later stages where it allows forward progress. Typical values are between 1e-10 and 1e-4. This parameter is only active if adaptive learning rate is enabled. Default is 1e-6. See section for more details. rate: The learning rate. Higher values lead to less stable models while lower values lead to slower convergence. Default is rate_annealing: Default value is 1e-6 (when adaptive learning is disabled). See section for more details. rate_decay: Default is 1.0 (when adaptive learning is disabled). The learning rate decay parameter controls the change of learning rate across layers. momentum_start: The momentum start parameter controls the amount of momentum at the beginning of training (when adaptive learning is disabled). Default is 0. See section for more details.
16 28 Deep Learning with H2O Appendix A: Complete Parameter List 29 momentum_ramp: The momentum ramp parameter controls the amount of learning for which momentum increases assuming momentum stable is larger than momentum start. It can be enabled when adaptive learning is disabled. The ramp is measured in the number of training samples. Default is 1e-6. See section3.4.1 for more details. momentum_stable: The momentum stable parameter controls the final momentum value reached after momentum ramp training samples (when adaptive learning is disabled). The momentum used for training will remain the same for training beyond reaching that point. Default is 0. See section for more details. nesterov_accelerated_gradient: The default is true (when adaptive learning is disabled). See section for more details. input_dropout_ratio: The default is 0. See section 3.3 for more details. hidden_dropout_ratio: The default is 0. See section 3.3 for more details. l1: L1 regularization (can add stability and improve generalization, causes many weights to become 0. The default is 0. See section 3.3 for more details. l2: L1 regularization (can add stability and improve generalization, causes many weights to be small. The default is 0. See section 3.3 for more details. max_w2: A maximum on the sum of the squared incoming weights into any one neuron. This tuning parameter is especially useful for unbound activation functions such as Maxout or Rectifier. The default leaves this maximum unbounded. initial_weight_distribution: The distribution from which initial weights are to be drawn. The default is the uniform adaptive option. Other options are Uniform and Normal distributions. See section for more details. initial_weight_scale: The scale of the distribution function for Uniform or Normal distributions. For Uniform, the values are drawn uniformly from (-initial_weight_scale, initial_weight_scale). For Normal, the values are drawn from a Normal distribution with a standard deviation of initial weight scale. The default is 1.0. See section for more details. loss: The default is automatic based on the particular learning problem. See section for more details. score_interval: The minimum time (in seconds) to elapse between model scoring. The actual interval is determined by the number of training samples per iteration and the scoring duty cycle. Default is 5. score_training_samples: The number of training dataset points to be used for scoring. Will be randomly sampled. Use 0 for selecting the entire training dataset. Default is score_validation_samples: The number of validation dataset points to be used for scoring. Can be randomly sampled or stratified (if "balance classes" is set and "score validation sampling" is set to stratify). Use 0 for selecting the entire training dataset (this is also the default). score_duty_cycle: Maximum fraction of wall clock time spent on model scoring on training and validation samples, and on diagnostics such as computation of feature importances (i.e., not on training). Default is 0.1. classification_stop: The stopping criteria in terms of classification error (1-accuracy) on the training data scoring dataset. When the error is at or below this threshold, training stops. Default is 0. regression_stop: The stopping criteria in terms of regression error (MSE) on the training data scoring dataset. When the error is at or below this threshold, training stops. Default is 1e-6. quiet_mode: Enable quiet mode for less output to standard output. Default is false. max_confusion_matrix_size: For classification models, the maximum size (in terms of classes) of the confusion matrix for it to be printed. This option is meant to avoid printing extremely large confusion matrices. Default is 20. max_hit_ratio_k: The maximum number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable). Default is 10. balance_classes: For imbalanced data, balance training data class counts via over/undersampling. This can result in improved predictive accuracy. Default is false. class_sampling_factors: Desired over/under-sampling ratios per class (lexicographic order). Only when balance classes is enabled. If not specified, they will be automatically computed to obtain class balance during training. max_after_balance_size: When classes are balanced, limit the resulting dataset size to the specified multiple of the original dataset size. This is the maximum relative size of the training data after balancing class counts (can be less than 1.0). Default is 5.0.
17 30 Deep Learning with H2O Appendix B: References 31 score_validation_sampling: Method used to sample validation dataset for scoring. The possible methods are Uniform and Stratified. Default is Uniform. diagnostics: Gather diagnostics for hidden layers, such as mean and RMS values of learning rate, momentum, weights and biases. Default is true. variable_importances: Whether to compute variable importances for input features. The implementation considers the weights connecting the input features to the first two hidden layers. Default is false. fast_mode: Enable fast mode (minor approximation in back-propagation), should not affect results significantly. Default is true. ignore_const_cols: Ignore constant training columns (no information can be gained anyway). Default is true. force_load_balance: Increase training speed on small datasets by splitting it into many chunks to allow utilization of all cores. Default is true. replicate_training_data: Replicate the entire training dataset onto every node for faster training on small datasets. Default is true. single_node_mode: Run on a single node for fine-tuning of model parameters. Can be useful for faster convergence during checkpoint resumes after training on a very large count of nodes (for fast initial convergence). Default is false. shuffle_training_data: Enable shuffling of training data (on each node). This option is recommended if training data is replicated on N nodes, and the number of training samples per iteration is close to N times the dataset size, where all nodes train will (almost) all the data. It is automatically enabled if the number of training samples per iteration is set to -1 (or to N times the dataset size or larger), otherwise it is disabled by default. max_categorical_features: Max. number of categorical features, enforced via hashing (Experimental). reproducible: Force reproducibility on small data (will be slow - only uses 1 thread). sparse: Enable sparse data handling (experimental). col_major: Use a column major weight matrix for the input layer; can speed up forward propagation, but may slow down backpropagation. input_dropout_ratios: Enable input layer dropout ratios, which can improve generalization, by specifying one value per hidden layer. The default is References H2O website H2O documentation H2O Github Repository H2O Training H2O Training Scripts and Data Code for this Document DeepLearningRVignetteDemo H2O support [email protected] H2O JIRA YouTube Videos Learning Deep Architectures for AI. Bengio, Yoshua, ca/~lisa/pointeurs/tr1312.pdf Efficient BackProp. LeCun et al, Maxout Networks. Goodfellow et al, HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Niu et al, Improving neural networks by preventing co-adaptation of feature detectors. Hinton et al., On the importance of initialization and momentum in deep learning. Sutskever et al, ADADELTA: AN ADAPTIVE LEARNING RATE METHOD. Zeiler, H2O GitHub repository for the H2O Deep Learning documentation MNIST database Reducing the Dimensionality of Data with Neural Networks. Hinton et al,
18 Biographies Viraj Parmar Viraj is currently an undergraduate at Princeton studying applied mathematics. Prior to joining H2O.ai as a data and math hacker intern, Viraj worked in a research group at the MIT Center for Technology and Design. His interests are in software engineering and large-scale machine learning. Apart from work, Viraj enjoys reading, sampling new cuisines, and traveling with his family. Arno Candel Arno is a Physicist & Hacker at H2O.ai. Prior to that, he was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world's largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite- element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes. He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and is a sought-after conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. Arno was named "2014 Big Data All-Star" by Fortune Magazine.
19 Smoooooth! - if I have to explain it in one word. H2O made this really easy for R users. - Jo-Fai Chow
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Scalable Data Science and Deep Learning with H2O. Arno Candel, PhD Chief Architect, H2O.ai
Scalable Data Science and Deep Learning with H2O Arno Candel, PhD Chief Architect, H2O.ai Who Am I? Arno Candel Chief Architect, Physicist & Hacker at H2O.ai PhD Physics, ETH Zurich 2005 10+ yrs Supercomputing
Lecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Neural Network Add-in
Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
IBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
IBM SPSS Neural Networks 19
IBM SPSS Neural Networks 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 95. This document contains proprietary information of SPSS
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...
IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Big Data Analytics Using Neural networks
San José State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 4-1-2014 Big Data Analytics Using Neural networks Follow this and additional works at: http://scholarworks.sjsu.edu/etd_projects
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Applying Data Science to Sales Pipelines for Fun and Profit
Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
Data Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Data Mining mit der JMSL Numerical Library for Java Applications
Data Mining mit der JMSL Numerical Library for Java Applications Stefan Sineux 8. Java Forum Stuttgart 07.07.2005 Agenda Visual Numerics JMSL TM Numerical Library Neuronale Netze (Hintergrund) Demos Neuronale
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Deep dive into Haven Predictive Analytics Powered by HP Distributed R and
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Using Adaptive Random Trees (ART) for optimal scorecard segmentation
A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/
Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM
Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Follow links Class Use and other Permissions. For more information, send email to: [email protected]
COPYRIGHT NOTICE: David A. Kendrick, P. Ruben Mercado, and Hans M. Amman: Computational Economics is published by Princeton University Press and copyrighted, 2006, by Princeton University Press. All rights
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee [email protected] [email protected] Abstract We
Structural Health Monitoring Tools (SHMTools)
Structural Health Monitoring Tools (SHMTools) Getting Started LANL/UCSD Engineering Institute LA-CC-14-046 c Copyright 2014, Los Alamos National Security, LLC All rights reserved. May 30, 2014 Contents
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics
White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
How I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
Neural Networks and Back Propagation Algorithm
Neural Networks and Back Propagation Algorithm Mirza Cilimkovic Institute of Technology Blanchardstown Blanchardstown Road North Dublin 15 Ireland [email protected] Abstract Neural Networks (NN) are important
DYNAMIC LOAD BALANCING OF FINE-GRAIN SERVICES USING PREDICTION BASED ON SERVICE INPUT JAN MIKSATKO. B.S., Charles University, 2003 A THESIS
DYNAMIC LOAD BALANCING OF FINE-GRAIN SERVICES USING PREDICTION BASED ON SERVICE INPUT by JAN MIKSATKO B.S., Charles University, 2003 A THESIS Submitted in partial fulfillment of the requirements for the
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution
Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution Rie Johnson Tong Zhang 1 Introduction This document describes our entry nominated for the second prize of the Heritage
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
Application of Neural Network in User Authentication for Smart Home System
Application of Neural Network in User Authentication for Smart Home System A. Joseph, D.B.L. Bong, D.A.A. Mat Abstract Security has been an important issue and concern in the smart home systems. Smart
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Stock Prediction using Artificial Neural Networks
Stock Prediction using Artificial Neural Networks Abhishek Kar (Y8021), Dept. of Computer Science and Engineering, IIT Kanpur Abstract In this work we present an Artificial Neural Network approach to predict
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
