EPL451: Data Mining on the Web Lab 7

Similar documents
Maven2. Configuration and Build Management. Robert Reiz

Hands on exercise for

Software project management. and. Maven

Build management & Continuous integration. with Maven & Hudson

Maven or how to automate java builds, tests and version management with open source tools

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Continuous integration in OSGi projects using Maven (v:0.1) Sergio Blanco Diez

Continuous Integration Multi-Stage Builds for Quality Assurance

Maven2 Reference. Invoking Maven General Syntax: Prints help debugging output, very useful to diagnose. Creating a new Project (jar) Example:

Software project management. and. Maven

Software Quality Exercise 2

Content. Development Tools 2(63)

by Charles Souillard CTO and co-founder, BonitaSoft

Tutorial- Counting Words in File(s) using MapReduce

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Presentation of Enterprise Service Bus(ESB) and. Apache ServiceMix. Håkon Sagehaug

How to Run Spark Application

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Sonatype CLM for Maven. Sonatype CLM for Maven

Service Integration course. Cassandra

Running Knn Spark on EC2 Documentation

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Automated performance testing using Maven & JMeter. George Barnett, Atlassian Software

Word Count Code using MR2 Classes and API

High-Speed In-Memory Analytics over Hadoop and Hive Data

Sonatype CLM Enforcement Points - Continuous Integration (CI) Sonatype CLM Enforcement Points - Continuous Integration (CI)

Meister Going Beyond Maven

Mammoth Scale Machine Learning!

Integrating your Maven Build and Tomcat Deployment

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Teaming Up for Software Development

SparkLab May 2015 An Introduction to

SOLoist Automation of Class IDs Assignment

Integration with Other Tools

Integration with ESB. Filip Nguyen Jiří Kolář

Java Forum Nord Dirk Mahler

Mind The Gap! Setting Up A Code Structure Building Bridges

TIBCO Runtime Agent Authentication API User s Guide. Software Release November 2012

D5.4.4 Integrated SemaGrow Stack API components

extensible Service Bus (XSB)

COMPUTACIÓN ORIENTADA A SERVICIOS (PRÁCTICA) Dr. Mauricio Arroqui EXA-UNICEN

Configuration Manual Yahoo Cloud System Benchmark (YCSB) 24-Mar-14 SEECS-NUST Faria Mehak

Vaidya Guide. Table of contents

IBM Tivoli Workload Scheduler Integration Workbench V8.6.: How to customize your automation environment by creating a custom Job Type plug-in

Developer Guide: Smartphone Mobiliser Applications. Sybase Mobiliser Platform 5.1 SP03

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

SDK Code Examples Version 2.4.2

Builder User Guide. Version 5.4. Visual Rules Suite - Builder. Bosch Software Innovations

Single Node Setup. Table of contents

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

CI/CD Cheatsheet. Lars Fabian Tuchel Date: 18.March 2014 DOC:

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Text Clustering Using LucidWorks and Apache Mahout

Builder User Guide. Version Visual Rules Suite - Builder. Bosch Software Innovations

How To Install Hadoop From Apa Hadoop To (Hadoop)

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Running Kmeans Mapreduce code on Amazon AWS

Hadoop Installation MapReduce Examples Jake Karnes

Contents. Apache Log4j. What is logging. Disadvantages 15/01/2013. What are the advantages of logging? Enterprise Systems Log4j and Maven

Hadoop Training Hands On Exercise

Maven by Example. Maven by Example. Ed. 0.7

ORACLE GOLDENGATE BIG DATA ADAPTER FOR FLUME

MapReduce. Tushar B. Kute,

Provider s Risk Assessment Tools Installation Guide

ORACLE GOLDENGATE BIG DATA ADAPTER FOR HIVE

HSearch Installation

Web Document Clustering

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Setting up Hadoop with MongoDB on Windows 7 64-bit

Concordion. Concordion. Tomo Popovic, tp0x45 [at] gmail.com

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Apache Hama Design Document v0.6

Handling Big Data Stream Analytics using SAMOA Framework - A Practical Experience

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

1 Building, Deploying and Testing DPES application

Maven 2 in the real world

HDFS Cluster Installation Automation for TupleWare

Adobe Summit 2015 Lab 712: Building Mobile Apps: A PhoneGap Enterprise Introduction for Developers

Laboration 3 - Administration

Rumen. Table of contents

Introduction to Eclipse

Introduction to HP ArcSight ESM Web Services APIs

Basic Hadoop Programming Skills

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Drupal CMS for marketing sites

Apache Karaf in real life ApacheCon NA 2014

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Configuring Informatica Data Vault to Work with Cloudera Hadoop Cluster

Cascading Pattern - How to quickly migrate Predictive Models (PMML) from SAS, R, Micro Strategies etc., onto Hadoop and deploy them at scale

Kohsuke Kawaguchi Sun Microsystems, Inc. hk2.dev.java.net, glassfish.dev.java.net. Session ID

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Enterprise Service Bus

To reduce or not to reduce, that is the question

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

5 HDFS - Hadoop Distributed System

Tutorial 5: Developing Java applications

Transcription:

EPL451: Data Mining on the Web Lab 7 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science

Mahout clustering in Java Three steps prior running Mahout clustering algorithms: Data preprocessing Optional

Input data prior clustering Marking the initial clusters is an important step in k-means clustering

Input data after clustering Even with distant centers, k-means algorithm is able to correctly iterate and re-evaluate centers based on the Euclidean Distance Measure

Playing around

Task 1 Download three (3) Java files implementing the simple k-means example presented before related hadoop & mahout classes must be imported Problem: number of jar files (and their dependencies) must be downloaded and added to classpath difficult to manually specify and discover all dependency libraries Solution: Apache tool for building and managing any Java-based project excellent dependency management mechanism easy build process

Task 1 create Maven project Installation: sudo apt-get install maven Create Maven project mvn archetype:generate -DgroupId=com.csdeptucy.app -DartifactId=kmeans-clustering -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false Get into project folder cd kmeans-clustering see project structure here POM.xml file core of project s configuration

POM file example <project xmlns="http://maven.apache.org/pom/4.0.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://maven.apache.org/pom/4.0.0 http://maven.apache.org/xsd/maven- 4.0.0.xsd"> <modelversion>4.0.0</modelversion> <groupid>com.mycompany.app</groupid> <artifactid>my-app</artifactid> <version>1.0-snapshot</version> <packaging>jar</packaging> <name>maven Quick Start Archetype</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupid>junit</groupid> <artifactid>junit</artifactid> <version>4.8.2</version> <scope>test</scope> </dependency> </dependencies> </project>

Maven phases Most common lifecycle phases: validate: validate the project is correct and all necessary information is available compile: compile the source code of the project test: test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed package: take the compiled code and package it in its distributable format, such as a JAR integration-test: process and deploy the package if necessary into an environment where integration tests can be run verify: run any checks to verify the package is valid and meets quality criteria install: install the package into the local repository, for use as a dependency in other projects locally deploy: done in an integration or release environment, copies the final package to the remote repository for sharing with other developers and projects clean: cleans up artifacts created by prior builds site: generates site documentation for this project Phases may be executed in sequence mvn clean package

Test initial application Test the newly compiled and packaged JAR with the following command: java -cp target/kmeans-clustering-1.0- SNAPSHOT.jar com.csdeptucy.app.app Which will print: Hello World!

Task 1 Insights In order to run a k-means clustering job the method KMeansDriver.run() from Mahout API (version 0.7) is required to be invoked public static void run( org.apache.hadoop.conf.configuration conf, org.apache.hadoop.fs.path input, org.apache.hadoop.fs.path clustersin, org.apache.hadoop.fs.path output, DistanceMeasure measure, double convergencedelta, int maxiterations, boolean runclustering, double clusterclassificationthreshold, boolean runsequential)

run() method parameters input - the directory pathname for input points clustersin - the directory pathname for initial & computed clusters output - the directory pathname for output points measure - the DistanceMeasure to use convergencedelta - the convergence delta value maxiterations - the maximum number of iterations runclustering - true if points are to be clustered after iterations are completed clusterclassificationthreshold - Is a clustering strictness / outlier removal parameter. Its value should be between 0 and 1. Vectors having pdf below this value will not be clustered. runsequential - if true execute sequential algorithm

Task 1: Unzip LAB07.zip Place 3 java files into /kmeansclustering/src/main/java/com/csdeptucy/app folder Replace the pom.xml file Clean artifacts form the previous build and regenerate a jar file mvn clean package In case of java.lang.outofmemoryerror: Java heap space error run in terminal: export MAVEN_OPTS=-Xmx1024m mvn clean package Run the application (start Hadoop first ) sudo java -cp target/kmeans-clustering-1.0- SNAPSHOT-jar-with-dependencies.jar com.csdeptucy.app.simplekmeansclustering

Task 1 - Results 1.0: [1.000, 1.000] belongs to cluster 0 1.0: [2.000, 1.000] belongs to cluster 0 1.0: [1.000, 2.000] belongs to cluster 0 1.0: [2.000, 2.000] belongs to cluster 0 1.0: [3.000, 3.000] belongs to cluster 0 1.0: [8.000, 8.000] belongs to cluster 1 1.0: [9.000, 8.000] belongs to cluster 1 1.0: [8.000, 9.000] belongs to cluster 1 1.0: [9.000, 9.000] belongs to cluster 1

Task 2 Use SimpleKMeansClustering.java file to create a new java program to cluster Reuters articles preprocessed in the previous lab (Lab 6) Reuters data in HDFS, download locally using: hadoop fs -copytolocal /user/csdeptucy/mahout-work The input points (as vectors) are stored in folder mahout-work/reuters-out-seqdir-sparsekmeans/tfidf-vectors The clustersin data (initial cluster centers) are stored in folder mahout-work/reuters-kmeansclusters/part-randomseed Make use of CosineDistanceMeasure() function For other params use the same values as in Lab 6

Useful tips Sample: https://github.com/tdunning/mia/tree/mahout-0.7 Javadocs: Hadoop: https://hadoop.apache.org/docs/r2.6.3/api/ Mahout: http://archive-primary.cloudera.com/cdh4/cdh/4/mahout-0.7- cdh4.3.2/mahout-core/