SparkLab May 2015 An Introduction to



Similar documents
How to Run Spark Application

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Running Knn Spark on EC2 Documentation

How to Backup XenServer VM with VirtualIQ

Single Node Hadoop Cluster Setup

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Instructions for Accessing the Advanced Computing Facility Supercomputing Cluster at the University of Kansas

Extending Remote Desktop for Large Installations. Distributed Package Installs

Install BA Server with Your Own BA Repository

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

CDH installation & Application Test Report

Yocto Project Eclipse plug-in and Developer Tools Hands-on Lab

Hadoop (pseudo-distributed) installation and configuration

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Lab 1 Beginning C Program

INASP: Effective Network Management Workshops

Tips for getting started! with! Virtual Data Center!

Department of Veterans Affairs VistA Integration Adapter Release Enhancement Manual

Shark Installation Guide Week 3 Report. Ankush Arora

Laboration 3 - Administration

Running Kmeans Mapreduce code on Amazon AWS

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

INSTALLING KAAZING WEBSOCKET GATEWAY - HTML5 EDITION ON AN AMAZON EC2 CLOUD SERVER

SETTING UP A LAMP SERVER REMOTELY

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Hadoop Basics with InfoSphere BigInsights

HADOOP - MULTI NODE CLUSTER

Desktop : Ubuntu Desktop, Ubuntu Desktop Server : RedHat EL 5, RedHat EL 6, Ubuntu Server, Ubuntu Server, CentOS 5, CentOS 6

Hadoop Installation MapReduce Examples Jake Karnes

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

How to use the Eclipse IDE for Java Application Development

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

Hadoop Data Warehouse Manual

Installing Hadoop. Hortonworks Hadoop. April 29, Mogulla, Deepak Reddy VERSION 1.0

Source Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE

Using Internet or Windows Explorer to Upload Your Site

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Cassandra Installation over Ubuntu 1. Installing VMware player:

Configuring MailArchiva with Insight Server

Kaltura On-Prem Evaluation Package - Getting Started

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

Linux Overview. Local facilities. Linux commands. The vi (gvim) editor

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Installation of PHP, MariaDB, and Apache

CS2510 Computer Operating Systems Hadoop Examples Guide

Uploads from client PC's to mercury are not enabled for security reasons.

SIMIAN systems. Setting up a Sitellite development environment on Mac OS X. Sitellite Content Management System

Creating a Java application using Perfect Developer and the Java Develo...

Deploying MongoDB and Hadoop to Amazon Web Services

Cloudera Manager Training: Hands-On Exercises

Using Network Attached Storage with Linux. by Andy Pepperdine

Recommended File System Ownership and Privileges

Mobile Labs Plugin for IBM Urban Code Deploy

CPSC 226 Lab Nine Fall 2015

Installing a Symantec Backup Exec Agent on a SnapScale Cluster X2 Node or SnapServer DX1 or DX2. Summary

Connecting to the School of Computing Servers and Transferring Files

Installation Guidelines (MySQL database & Archivists Toolkit client)

Installation & Upgrade Guide

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Rstudio Server on Amazon EC2

Viking VPN Guide Linux/UNIX

Creating a DUO MFA Service in AWS

Hadoop Installation. Sandeep Prasad

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

Lab 0: Preparing your laptop for the course OS X

CONNECTING TO DEPARTMENT OF COMPUTER SCIENCE SERVERS BOTH FROM ON AND OFF CAMPUS USING TUNNELING, PuTTY, AND VNC Client Utilities

Massey University Follow Me Printer Setup for Linux systems

CS 103 Lab Linux and Virtual Machines

Hadoop Basics with InfoSphere BigInsights

IDS 561 Big data analytics Assignment 1

A Study of Data Management Technology for Handling Big Data

Hadoop Tutorial. General Instructions

Ricardo Perdigao, Solutions Architect Edsel Garcia, Principal Software Engineer Jean Munro, Senior Systems Engineer Dan Mitchell, Principal Systems

SIMIAN systems. Setting up a Sitellite development environment on Windows. Sitellite Content Management System

APPLICATION NOTE. How to build pylon applications for ARM

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Editing Locally and Using SFTP: the FileZilla-Sublime-Terminal Flow

TP1: Getting Started with Hadoop

Install guide for Websphere 7.0

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Using Actian PSQL as a Data Store with VMware vfabric SQLFire. Actian PSQL White Paper May 2013

Configuration Manual Yahoo Cloud System Benchmark (YCSB) 24-Mar-14 SEECS-NUST Faria Mehak

There s a variety of software that can be used, but the approach described here uses freely available Cygwin software: (1) Cygwin/X (2) Cygwin/openssh

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Livezilla How to Install on Shared Hosting By: Jon Manning

OpenGeo Suite for Linux Release 3.0

University of Amsterdam VPN Linux User Guide (Version 1.2)

Installation Guide of the Change Management API Reference Implementation

Installation Guide. Version 2.1. on Oracle Java Cloud Service

18.2 user guide No Magic, Inc. 2015

NYU-Poly VLAB Introduction LAB 0

User Manual of the Pre-built Ubuntu Virutal Machine

File Transfer Examples. Running commands on other computers and transferring files between computers

Oracle EXAM - 1Z Oracle Weblogic Server 11g: System Administration I. Buy Full Product.

Compiere 3.2 Installation Instructions Windows System - Oracle Database

Using VirtualBox ACHOTL1 Virtual Machines

The Einstein Depot server

AzMERIT Secure Browser Installation Manual For Technology Coordinators

Transcription:

SparkLab May 2015 An Introduction to & Apostolos N. Papadopoulos Assistant Professor Data Engineering Lab, Department of Informatics, Aristotle University of Thessaloniki Abstract Welcome to SparkLab! In this seminar, we are going to learn the very basics in cluster computing using the Scala programming language and the exciting Spark engine. Spark steadily becomes the state-of-the-art in cluster computing and big data processing and analytics due to the excellent support it provides for several domains such as: SQL processing, Streaming, Machine Learning and Graphs. In addition, Spark supports three programming languages: Java, Scala and Python. In this seminar, we are going to use the Scala programming language. To get the best out of the seminar, you are advised to bring your laptops in class. Also, to participate in the hands-on experience, you will need to either install the required software locally or logon to a remote server using ssh. However, in order to re-execute the examples and to write and test your own code in home, it is better to use the first option, and use the second option only as a fallback. 1. Installation for Local Access We assume that your are working with Linux and more specifically with Ubuntu. The examples have been tested with Ubuntu, CentOS and Mint. Most probably, everything will run smoothly in any other Linux distribution. The same applies for MacOS. Window users can consult the Apache Spark website for installation instructions. For the examples to work we are going to need the following software: java jdk (7 or 8 is recommended) scala version 2.10.4 sbt version 0.13.8 spark version 1.3.1 Java Installation (if you already have Java version 1.6 or later skip this section) To install Java (if it is not already in your system), you can either use any guideline you will find in the web, or use the following steps for Java 8. sudo add apt repository ppa:webupd8team/java sudo apt get update sudo apt get install oracle java8 installer In case you are not working with apt, use the most convenient way for you to install Java in your system. 1

Sbt, Scala and Spark Installation In fact, these packages do not need installation. All you need is to download a tarfile and extract it. For uniformity, it is better to create a directory named sparklab in your home directory and put everything there. For the seminar, I am going to assume, that these packages are extracted in /usr/local. Download sbt from the following URL: http://www.scala sbt.org/download.html Extract the tarfile in /usr/local and you should now have sbt stored under /usr/local/sbt 0.13.8 Download Scala version 2.10.4 using the following URL http://www.scala lang.org/download/2.10.4.html Again, extract the tarfile inside /usr/local to get a directory /usr/local/scala 2.10.4 Finally, download Spark version 1.3.1 using the following URL https://spark.apache.org/downloads.html Be sure to download the file that refers to Pre-built for Hadoop 2.6 and later and the file should be spark 1.3.1 bin hadoop2.6.tgz Again, extract the tarfile inside /usr/local to get /usr/local/spark 1.3.1 bin hadoop2.6 Note that in order to extract tarfiles inside /usr/local we should either have root permissions or use sudo. The next step is to put some exports in the.bashrc file to update the PATH environmental variable. Based on the previous discussion, you should copy-paste the following lines in your.bashrc file which located in your home directory. export PATH=$PATH:/usr/local/scala 2.10.4/bin export PATH=$PATH:/usr/local/sbt 0.13.8/bin export PATH=$PATH:/usr/local/spark 1.3.1 bin hadoop2.6/bin export PATH=$PATH:/usr/local/spark 1.3.1 bin hadoop2.6/sbin Then run the command source ~/.bashrc in order to activate the updates performed into.bashrc (Note: You can also extract the required software anywhere in your filesystem. Just remember to set the PATH variable accordingly.) Installation of the Examples The examples tarfile can be downloaded from http://delab.csd.auth.gr/~apostol/sparklab.html Create a folder with the name sparklab and extract the examples tarfile inside the sparklab folder. Now, you are ready to go! 2

2. Remote Access In case you have a problem with the previous process, you can still run the demos by using a remote installation of spark. To be able to run the demos presented in class, you will need to login to a remote server using ssh (secure shell). Please follow these steps: 1. Login to the remote server that will be given to you during the seminar. Assume, that this is server.csd.auth.gr using the following command: ssh username@server.csd.auth.gr The servername, the username and the password will be known to you during the seminar. (Note: Windows users can use the PuTTY ssh client that can be downloaded from: http://www.putty.org/) Attention: please do not delete or change any file since this account will be shared by all participants!!! 2. After login, go to the sparklab folder by executing: cd sparklab 3. Check the contents of the folder by executing: ls las You should see the folder examples 4. Enter the examples folder by executing cd examples There you should see two folders, scala and spark. 3. Running the Scala REPL If you want to test some code in the Scala programming language, the easiest way is to use the Scala REPL (Read-Evaluate-Print Loop). This is an interactive tool which accepts Scala code and provides the result in an interactive fashion. To start the REPL issue the command: scala You will see the scala prompt waiting for commands. We are going to demonstrate the use of the REPL in class. You can exit the REPL by giving the command :q The REPL looks like the following screenshot. 4. Running Scala Examples Compilation and code running will be performed by using the Simple Build Tool (sbt). First, we will talk about Scala so, write cd scala and inspect the contents of the folder by executing ls las. You should see 12 3

folders: 01 HelloWorld, 02 LineCount, etc. Assume that you want to run the hello world example. Enter the folder 01 HelloWorld by executing cd 01 HelloWorld (Note: use command completion using the TAB key for faster typing.) Lets see the source code. Type the following command: cat./src/main/scala/helloworld.scala (Note: if you are familiar with vi editor then of course you can use it.) The result should be: object HelloWorld { def main(args: Array[String]): Unit = { println("hello, world!") To compile the code sbt compile Run the code by typing: sbt run Use the same procedure to run the rest of Scala examples. 5. Running Spark Examples Now we are going to test some Spark examples. The code is written in Scala but this time we are going to use the libraries provided by Spark in order to facilitate cluster computations. Assuming that you are in the examples folder, go into the spark folder by typing: cd spark You should see 7 folders named 01 HelloSpark through 07 TriangleCounting. Enter the hello spark folder by executing: cd 01 HelloSpark First, check the source code by typing: cat./src/main/scala/hellospark.scala You should see the following Scala code: import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ import org.apache.spark.sparkconf object HelloSpark { def main(args: Array[String]): Unit = { println("*************************************************") println("hello, Spark!") println("*************************************************") 4

To build the project execute the command: sbt package Execute the code by calling the spark submit command as follows: spark submit class "HelloSpark"./target/scala 2.10/hellospark_2.10 1.0.jar Check the output on your screen. Use the same procedure to run the rest of Spark examples. (Note: For interactive use of Spark, you may execute the Spark Shell by issuing the command spark shell. This is an interactive tool (just like the Scala REPL) with additional capabilities to use RDDs and write code for Spark. This is a nice option to test Spark, try some code and write small programs. For larger programs, it is better to write your code in an IDE, package the code and then execute it using spark submit.) In case of problems, please contact me: papadopo@csd.auth.gr Happy Sparking! 5