Package PRISMA. February 15, 2013



Similar documents
Package neuralnet. February 20, 2015

Package retrosheet. April 13, 2015

Package bigdata. R topics documented: February 19, 2015

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Package cpm. July 28, 2015

Open source framework for data-flow visual analytic tools for large databases

Component Ordering in Independent Component Analysis Based on Data Power

Package empiricalfdr.deseq2

Package EstCRM. July 13, 2015

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

The Data Mining Process

Package MBA. February 19, Index 7. Canopy LIDAR data

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Big Data and Scripting map/reduce in Hadoop

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Package bigrf. February 19, 2015

Oracle Data Miner (Extension of SQL Developer 4.0)

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Package MixGHD. June 26, 2015

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Package MDM. February 19, 2015

Web Document Clustering

Package HHG. July 14, 2015

Analysis Tools and Libraries for BigData

White Paper April 2006

Package ATE. R topics documented: February 19, Type Package Title Inference for Average Treatment Effects using Covariate. balancing.

Package trimtrees. February 20, 2015

Package httprequest. R topics documented: February 20, 2015

Package erp.easy. September 26, 2015

Package uptimerobot. October 22, 2015

Package ChannelAttribution

Package searchconsoler

Analytics on Big Data

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Package missforest. February 20, 2015

Similarity Search in a Very Large Scale Using Hadoop and HBase

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

New Matrix Approach to Improve Apriori Algorithm

How To Cluster

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Distributed Computing and Big Data: Hadoop and MapReduce

Data Mining in Incident Response Challenges and Opportunities

Package hoarder. June 30, 2015

Package fimport. February 19, 2015

Package hier.part. February 20, Index 11. Goodness of Fit Measures for a Regression Hierarchy

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Package dsmodellingclient

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Big Data Analytics CSCI 4030

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Fast Analytics on Big Data with H20

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing

SAP Predictive Analysis Installation

Statistical Machine Learning from Data

Identifying SPAM with Predictive Models

Package lmertest. July 16, 2015

extreme Datamining mit Oracle R Enterprise

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Introduction to Learning & Decision Trees

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Machine Learning using MapReduce

Package tagcloud. R topics documented: July 3, 2015

CDD user guide. PsN Revised

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Self Organizing Maps for Visualization of Categories

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Package treemap. February 15, 2013

Mining an Online Auctions Data Warehouse

Package syuzhet. February 22, 2015

Package Rdsm. February 19, 2015

Unsupervised Data Mining (Clustering)

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

Impact Analysis for Software changes in OpenLAB CDS A.01.04

Intrusion Detection System using Log Files and Reinforcement Learning

Package plan. R topics documented: February 20, 2015

ADVANCED MACHINE LEARNING. Introduction

Package CoImp. February 19, 2015

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

npsolver A SAT Based Solver for Optimization Problems

Package changepoint. R topics documented: November 9, Type Package Title Methods for Changepoint Detection Version 2.

Package TSfame. February 15, 2013

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Transcription:

Package PRISMA February 15, 2013 Type Package Title Protocol Inspection and State Machine Analysis Version 0.1-0 Date 2012-10-02 Depends R (>= 2.10), Matrix, gplots, ggplot2 Author Tammo Krueger, Nicole Kraemer Maintainer The PRISMA package is capable of loading and processing huge text corpora processed with the sally toolbox (http://www.mlsec.org/sally/). sally acts as a ver fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some duplicat-aware, highly tuned Non-Negative Matrix Factorzation and Principal component Analysis implementation which allows the processing of very big data sets even on desktop machines. License GPL (>= 2.0) Repository CRAN Date/Publication 2012-10-08 10:38:21 NeedsCompilation no R topics documented: PRISMA-package...................................... 2 asap............................................. 3 estimatedimension..................................... 3 getduplicatedata...................................... 4 loadprismadata....................................... 5 1

2 PRISMA-package plot.prisma......................................... 6 plot.prismadimension.................................... 6 plot.prismamf....................................... 7 prismaduplicatepca.................................... 8 prismahclust........................................ 9 prismanmf......................................... 10 Index 12 PRISMA-package PRISMA - Protocol Inspection and State Machine Analysis Details The PRISMA package is capable of loading and processing huge text corpora processed with the sally toolbox (http://www.mlsec.org/sally/). sally acts as a ver fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some duplicat-aware, highly tuned Non- Negative Matrix Factorzation and Principal component Analysis implementation which allows the processing of very big data sets even on desktop machines. Package: PRISMA Type: Package Version: 0.1 Date: 2012-10-02 License: GPL (>=2.0) Tammo Krueger, Nicole Kraemer Maintainer: References Krueger, T., Gascon, H., Kraemer, N., Rieck, K. (2012) Learning Stateful Models for Network Honeypots 5th ACM Workshop on Artificial Intelligence and Security (AISEC 2012), accepted Krueger, T., Kraemer, N., Rieck, K. (2011) ASAP: Automatic Semantics-Aware Analysis of Network Payloads Privacy and Security Issues in Data Mining and Machine Learning - International ECML/PKDD Workshop. Lecture Notes in Computer Science 6549, Springer. 50-63

asap 3 plot(asap) asapdim = estimatedimension(asap) plot(asapdim) asapnmf = prismanmf(asap, asapdim, time=120) plot(asapnmf, minvalue=.2) asap Lengths of Major North American Rivers Toy data set to show the capabilities of the PRISMA package. asap Format A prisma object. References Krueger, T., Kraemer, N., Rieck, K. (2011) ASAP: Automatic Semantics-Aware Analysis of Network Payloads Privacy and Security Issues in Data Mining and Machine Learning - International ECML/PKDD Workshop. Lecture Notes in Computer Science 6549, Springer. 50-63 estimatedimension Estimate Inner Dimension Matrix factorization methods compress the original data matrix A R f,n with f features and N samples into two parts, namely A = BC with B R f,k, C R k,n. The function estimatedimension estimates k based on a noise model estimated from a scrambled version of the original data matrix. estimatedimension(prismadata, alpha = 0.05, nscramblesamples = NULL)

4 getduplicatedata Value prismadata A prismadata object loaded via loadprismadata alpha Error probability for confidence intervals nscramblesamples The number of scrambled samples that should be used to estimate the noise model. NULL means to use the complete data set. estdim prismadimension object that can be printed and plotted. References R.~Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3):276 280, 1986. asapdim = estimatedimension(asap) plot(asapdim) getduplicatedata Restores Data with Duplicates The loadprismadata function triggers a feature selection and combination methods which subsequently filters out duplicate values to save memory space. The getduplicatedata returns the data matrix with all duplicate entries. getduplicatedata(prismadata) prismadata prisma data loaded via loadprismadata Value datawithduplicates Data matrix containing explicit copies of all duplicates.

loadprismadata 5 datawithduplicates = getduplicatedata(asap) loadprismadata Load PRISMA Data Files Loads files generated by the sally tool (see http://www.mlsec.org/sally/) and represents the data as binary token/ngrams x documents matrix. After loading statistical tests are applied to find features, which are not volatile nor constant. Co-occurring features are grouped to further compactify the data. See system.file("extdata","sallypreprocessing.py", package="prisma") for a Python script which generates the corresponding.fsally file from a.sally file which increases the loading time via loadprismadata. loadprismadata(path, maxlines = -1, fastsally = TRUE, alpha = 0.05) path maxlines fastsally alpha path of the data file without the.sally extension. loadprisma loads path.sally or path.fsally depending on the fastsally switch. maximal number of lines to read from the data file. -1 means to read all lines. should the fsally file be used, which drastically increases loading time. significance level for the feature tests. Value prismadata data object representing the tokenized documents as features x samples matrix. References See http://www.mlsec.org/sally/ for the sally utility. # please extract system.file("extdata","asap.tar.gz", package="prisma") # before running this example asap = loadprismadata(system.file("extdata","asap", package="prisma"))

6 plot.prismadimension plot.prisma Generics For PRISMA Objects Print and plot generic for the PRISMA objects. ## S3 method for class prisma print(x,...) ## S3 method for class prisma plot(x,...) x... Not used See Also prisma data loaded via loadprismadata estimatedimension, prismahclust, prismaduplicatepca, prismanmf print(asap) plot(asap) plot.prismadimension Generics For PRISMA Objects Print and plot generic for the PRISMA dimension objects. ## S3 method for class prismadimension print(x,...) ## S3 method for class prismadimension plot(x,...)

plot.prismamf 7 x... Not used See Also prisma dimension object generated via estimatedimension estimatedimension, prismahclust, prismaduplicatepca, prismanmf asapdim = estimatedimension(asap) print(asapdim) plot(asapdim) plot.prismamf Generics For PRISMA Objects Print and plot generic for the PRISMA matrix factorization objects. ## S3 method for class prismamf plot(x, nlines = NULL, baseindex = NULL, sampleindex = NULL, minvalue = NULL, norowclustering = FALSE, nocolclustering = FALSE, type = c("base", "coordinates"),...) x nlines baseindex sampleindex Prisma matrix factorization object Number of lines that should be plotted Which bases should be plotted Which samples should be plotted minvalue Cut-off value, i.e., every value smaller than minvalue won t be shown norowclustering Don t cluster the rows nocolclustering Don t cluster the columns type Show the base (type = "base", i.e. the B matrix) or show the coordinate (type = "coordinates", i.e. the C matrix).... Not used

8 prismaduplicatepca See Also estimatedimension, prismahclust, prismaduplicatepca, prismanmf asapdim = estimatedimension(asap) asapnmf = prismanmf(asap, asapdim, time=120) plot(asapnmf, minvalue=.2) prismaduplicatepca Matrix Factorization Based on Duplicate-Aware PCA Efficient implementation of a duplicate-aware principal component anaylsis (PCA). prismaduplicatepca(prismadata) prismadata prisma data for which a PCA should be calculated Value prismapca Matrix factorization object $A = B C$, in which the factors are calculate by a duplicate-aware PCA asappca = prismaduplicatepca(asap)

prismahclust 9 prismahclust Matrix Factorization Based on Hierarchical Clustering A matrix factorization $A = B C$ based on the results of hclust is constructed, which holds the mean feature values for each cluster in the matrix $B$ and the indication of the cluster in the matrix $C$ for each data point (i.e. each data point is represented by its assigned cluster center). prismahclust(prismadata, ncomp, method = "single") prismadata ncomp method prisma data for which a clustering should be calculated the number of components that should be extracted. the method used for clustering. Value prismahclust Matrix factorization object containing $B$ and $C$ resulting from the hierarchical clustering of the data. See Also hclust asaphclust = prismahclust(asap)

10 prismanmf prismanmf Matrix Factorization Based on Duplicate-Aware NMF Matrix factorization A = BC with strictly positiv matrices B, C which minimize the reconstruction error A BC. This duplicate-aware version of the non-negtive matrix factorization (NMF) is based on the Alternating Least Square approach and exploits the duplicate information to speed up the calculation. prismanmf(prismadata, ncomp, time = 60, pca.init = TRUE, donorm = TRUE, oldresult = NULL) prismadata ncomp time pca.init donorm oldresult prisma data for which a NMF should be calculated either an integer or prismadimension object specifying the inner dimension of the matrix factorization. seconds after which the calculation should end should the B matrix be initialized by a PCA should the B matrix normalized (i.e. all columns have the Euclidean length of 1) re-use results of a previous run, i.e. B and C are pre-initialized with the values of this previous matrix factorization object. Value prismanmf Matrix factorization object containing tge $B$ and $C$ matrix. References Krueger, T., Gascon, H., Kraemer, N., Rieck, K. (2012) Learning Stateful Models for Network Honeypots 5th ACM Workshop on Artificial Intelligence and Security (AISEC 2012), accepted R. Albright, J. Cox, D. Duling, A. Langville, and C. Meyer. (2006) Algorithms, initializations, and convergence for the nonnegative matrix factorization. Technical Report 81706, North Carolina State University

prismanmf 11 print(asap) plot(asap) asapdim = estimatedimension(asap) print(asapdim) plot(asapdim) # let it run for 60 second asapnmf = prismanmf(asap, asapdim, time=60) plot(asapnmf, minvalue=0.2) #...and for another 60 seconds re-using the results from the previous run asapnmf = prismanmf(asap, asapdim, time=60, oldresult=asapnmf) plot(asapnmf, minvalue=0.2)

Index Topic datasets asap, 3 Topic package PRISMA-package, 2 asap, 3 estimatedimension, 3, 6 8 getduplicatedata, 4 hclust, 9 loadprismadata, 5, 6 plot.prisma, 6 plot.prismadimension, 6 plot.prismamf, 7 print.prisma (plot.prisma), 6 print.prismadimension (plot.prismadimension), 6 PRISMA (PRISMA-package), 2 PRISMA-package, 2 prismaduplicatepca, 6 8, 8 prismahclust, 6 8, 9 prismanmf, 6 8, 10 12