Shark Installation Guide Week 3 Report Ankush Arora Last Updated: May 31,2014
CONTENTS Contents 1 Introduction 1 1.1 Shark..................................... 1 1.2 Apache Spark................................. 1 1.3 Scala...................................... 2 1.4 SBT(Simple Build Tool):........................... 3 1.5 Installation Section is organized as follows:................. 3 2 Installation 3 2.1 Details Of Installation............................ 3 2.2 Dependencies................................. 3 2.3 Installation Steps............................... 4 2.3.1 Getting Java and Scala:....................... 4 2.3.2 Getting Sbt:.............................. 4 2.3.3 Getting the Patched Hive:...................... 4 2.3.4 Spark and Shark:........................... 4 2.3.5 Shark:................................. 5 Ankush Arora June 8, 2014 1
1 INTRODUCTION 1 Introduction The report herein gives the introduction and usage of Shark, Spark and Scala as well as summarizes the procedure to install and setup these one by one. 1.1 Shark What is Shark? Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. Speed Of Shark: Run Hive queries up to 100x faster in memory, or 10x on disk.it uses the powerful Apache Spark engine to speed up computations. Hive Compatibility of Shark: Run unmodified Hive queries on existing warehouses.it reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. Simply install it alongside Hive. Spark Integration: Unlock your data with machine learning and statistics.by running on Spark, Shark can call complex analytics functions like machine learning right from SQL. Or call Shark inside your Spark jobs to load Hive data. Scalability: Use the same engine for both short and long queries.unlike other interactive SQL engines, Shark supports mid-query fault tolerance, letting it scale to large jobs too. 1.2 Apache Spark What is Apache Spark? Apache Spark is a fast and general engine for large-scale data processing. Speed Of Spark: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing Ankush Arora June 8, 2014 1
1 INTRODUCTION Ease of Use: Write applications quickly in Java, Scala or Python.Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells. Generality: Combine SQL, streaming, and complex analytics.spark powers a stack of high-level tools including Shark for SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application. Integrated with Hadoop: Spark can run on Hadoop 2 s YARN cluster manager, and can read any existing Hadoop data. If you have a Hadoop 2 cluster, you can run Spark without any installation needed. 1.3 Scala Scalable language Scala is an acronym for Scalable Language. This means that Scala grows with you. You can play with it by typing one-line expressions and observing the results.the languages scalability is the result of a careful integration of object-oriented and functional language concepts. Object-Oriented Scala is a pure-bred object-oriented language. Conceptually, every value is an object and every operation is a method-call. Functional Even though its syntax is fairly conventional, Scala is also a full-blown functional language. It has everything like first-class functions, a library with efficient immutable data structures etc. Seamless Java Interop: Scala runs on the JVM. Java and Scala classes can be freely mixed, no matter whether they reside in different projects or in the same. They can even mutually refer to each other, the Scala compiler contains a subset of a Java compiler. Ankush Arora June 8, 2014 2
2 INSTALLATION Functions are Objects Features from both sides are unified to a degree where Functional and Objectoriented can be seen as two sides of the same coin.e.g Functions in Scala are objects. and it s type is just a regular class. Future-Proof Scala shines when it comes to scalable server software that makes use of concurrent and synchronous processing, parallel utilization of multiple cores, and distributed processing in the cloud. Fun to learn It s fun to learn scala. 1.4 SBT(Simple Build Tool): SBT is a build tool that uses a Scala-based DSL. Additionally, SBT has some interesting features that come in handy during development, such as starting a Scala REPL with project classes and dependencies on the classpath, continuous compilation and testing with triggered execution, and much more. 1.5 Installation Section is organized as follows: 1. Details of Installation. 2. Dependencies. 3. Installation Steps. 2 Installation 2.1 Details Of Installation Shark-Version: Shark-0.7 Spark-Version: Spark-0.7 Scala-Version: Scala-2.9.3 Sbt-Version: Sbt-0.12.3 OS used: Ubuntu 12.04 2.2 Dependencies Hive Maven Hadoop Java Ankush Arora June 8, 2014 3
2 INSTALLATION 2.3 Installation Steps This guide describes the components needed to compile and run Shark from the beginning.installation steps are as follows: 2.3.1 Getting Java and Scala: only prerequisite for this guide is that you have Java version 6 or 7 and Scala 2.9.3 installed on your machine. If you don t have Scala 2.9.3, you can download it by running: 2.3.2 Getting Sbt: for building spark and shark sbt 0.12.3 need to be installed on your machine. If you don t have Sbt 0.12.3, you can install it as follows: 2.3.3 Getting the Patched Hive: Then download our patched version of Hive and untar it: 2.3.4 Spark and Shark: Clone the branch-0.7 branch of Spark from Github, and compile and publish Spark to your local repository: Clone the branch-0.7 branch of Shark from Github: Ankush Arora June 8, 2014 4
2 INSTALLATION 2.3.5 Shark: Ankush Arora June 8, 2014 5