High-Speed In-Memory Analytics over Hadoop and Hive Data

Size: px

Start display at page:

Download "High-Speed In-Memory Analytics over Hadoop and Hive Data"

Bernice Johnston
8 years ago
Views:

1 High-Speed In-Memory Analytics over Hadoop and Hive Data Big Data 2015

2 Apache Spark Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General execution graphs and powerful optimizations Up to 40x faster than Hadoop Compatible with Hadoop s storage APIs Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

3 Apache Spark iter.&1& iter.&2&."".""." Input& one6time& processing& query&1& query&2& Input& Distributed& memory& query&3&."".""." 10%100 "faster&than&network&and&disk&

4 Users

5 Shark Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x

6 Shark % %%Client% CLI% JDBC% Meta% store% SQL% Parser% Driver% Query% Optimizer% Cache%Mgr.% Physical%Plan% Execution% Spark% HDFS%

7 Software Stack Shark& (Hive&on&Spark)& Bagel& (Pregel&on&Spark)& Streaming& Spark& " Spark& Local& mode& EC2& Apache& Mesos& YARN&

8 Spark Configuration Download a binary release of apache Spark: spark bin-hadoop2.6.tgz

9 Spark Configuration In the conf directory of spark-home directory set (IN CASE) spark-env.sh file

10 Shark Configuration Shark has been subsumed by Spark SQL, a new module in Apache Spark:

11 Spark Running Running Spark Shell [scala]: $:~spark-*/bin/spark-shell Running Spark Shell [python]: $:~spark-*/bin/pyspark Spark Shell - Scala Welcome to / / / / \ \/ _ \/ _ `/ / '_/ / /. /\_,_/_/ /_/\_\ version /_/ Using Scala version (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_05) Type in expressions to have them evaluated. scala>

12 Spark Self-contained applications Java Spark API import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; import org.apache.spark.api.java.function.function; public class SimpleApp { public static void main(string[] args) { String logfile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logdata = sc.textfile(logfile).cache(); long numas = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("a"); } }).count(); long numbs = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numas + ", lines with b: " + numbs); } }

13 Spark Self-contained applications Java Spark API: configuration of Spark application! String logfile = "YOUR_SPARK_HOME/README.md"; SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logdata = sc.textfile(logfile).cache();

14 Spark Self-contained applications Java Spark API: Spark actions! long numas = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("a"); } }).count(); long numbs = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("b"); } }).count();

15 Spark Self-contained applications SimpleApp.java create logdata: an Object like [line1, line2, line3,...] sopra la panca la capra campa, sotto la panca la capra crepa! Lines with a: 1, lines with b: 0

16 Spark Self-contained applications pom.xml Maven Project <project> <modelversion>4.0.0</modelversion> <groupid>sparkproject</groupid> <artifactid>sparkproject</artifactid> <name>simple Project</name> <packaging>jar</packaging> <version>1</version> <dependencies> <dependency>  <groupid>org.apache.spark</groupid> <artifactid>spark-core_2.10</artifactid> <version>1.3.1</version> </dependency> <dependency> <groupid>org.apache.spark</groupid> <artifactid>spark-sql_2.10</artifactid> <version>1.3.1</version> </dependency> </dependencies> </project>

17 Spark Running - standalone Running Java Spark applications: $:~spark-*/bin/spark-submit --class "SimpleApp" --master local[4] SparkProject-1.0.jar output [terminal] Lines with a: 46, Lines with b: 23

18 Spark Running - standalone Running Java Spark applications exporting output in a text file: $:~spark-*/bin/spark-submit --class "SimpleApp" --master local[4] SparkProject-1.0.jar > output.txt output.txt Lines with a: 46, Lines with b: 23

19 Exercises Word Counting Filtering Log files Computing page rank of web sites Computing Pi value Computing Transitive closure of a graph Querying structured data via SHARK (Spark-SQL)

20 Spark Self-contained applications JavaWordCount.java create words: an Object like [word1, word2, word3,...] sopra la panca la capra campa, sotto la panca la capra crepa create ones: an Object like [(word1,1), (word2,1), (word3,1),...] (sopra,1) (la,1) (panca,1) (la,1) (capra,1) (campa,,1) (sotto,1) (la,1) (panca,1) (la,1) (capra,1) (crepa,1) Result:! Counting: 1 1 Counting: 2 1 Counting: 1 1 Counting: 3 1 Counting: 1 1! panca: 2 la: 4 campa,: 1 sotto: 1 crepa: 1 sopra: 1 capra: 2

21 Spark Self-contained applications JavaSimpleApp2: count for each line how many words contain letter a and for each line how many words contain letter b

22 Spark Self-contained applications JavaPageRank.java Give pages ranks (scores) based on links to them Basic'Idea'» Links from many pagesè high rank Give&pages&ranks&(scores)&based&on&links&to&them&» Link from a high- rank pageè high rank» Links&from&many&pages&!&high&rank&» Link&from&a&high4rank&page&!&high&rank& Image:&en.wikipedia.org/wiki/File:PageRank4hi4res42.png&&

23 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.0& 1.0& 1.0& 1.0&

24 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.0& 1& 1.0& 1& 0.5& 0.5& 0.5& 1.0& 0.5& 1.0&

25 Spark Self-contained applications JavaPageRank.java Algorithm' 1. Start each page at a rank of 1 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.85& 0.58& 1.0& 0.58&

26 Spark Self-contained applications JavaPageRank.java Algorithm' 1. Start each page at a rank of 1 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 0.58& 1.85& 0.58& 0.29& 0.29& 0.58& 0.5& 1.85& 0.5& 1.0&

27 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.31& 0.39&.'.'.' 1.72& 0.58&

28 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p / neighbors p to its neighbors 3. Set each rank page s rank contribs p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& & Final'state:' 1.44& 0.46& 1.37& 0.73&

29 Resources

30 High-Speed In-Memory Analytics over Hadoop and Hive Data Big Data 2015

Data Science in the Wild

Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General