How To Create A Data Visualization With Apache Spark And Zeppelin

Size: px

Start display at page:

Download "How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5"

Lizbeth Hodges
3 years ago
Views:

1 Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro

2 Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark and Zeppelin 2

3 BIG DATA AND ECOSYSTEM TOOLS

4 Big Data Data size beyond systems capability Terabyte, Petabyte, Exabyte Storage Commodity servers, RAID, SAN Processing In reasonable response time A challenge here 4

5 Tradition processing tools Move what? the data to the code or the code to the data move data to code Data move code to data Code Server Server 5

6 Traditional processing tools Traditional tools RDBMS, DWH, BI High cost Difficult to scale beyond certain data size price/performance skew data variety not supported 6

7 Map-Reduce and NoSQL Hadoop toolset Free and open source Commodity hardware Scales to exabytes(10 18 ), maybe even more Not only SQL Storage and query processing only Complements Hadoop toolset Volume, velocity and variety 7

maybe even more Not only SQL Storage and query

8 All is well? Hadoop was designed for batch processing Disk based processing: slow Many tools to enhance Hadoop s capabilities Distributed cache, Haloop, Hive, HBase Not for interactive and iterative 8

9 TOWARDS SINGULARITY

10 AI capacity What is singularity? 8000 Decade vs AI capacity Decade Point of singularity 10

11 Technological singularity When AI capability exceeds Human capacity AI or non-ai singularity 2045: The predicted year 11

12 APACHE SPARK

13 History of Spark March Spark created by PhD student at UC Berkeley, Matei Zaharia Spark is made open source. Available on github Spark donated to Apache Software Foundation Spark released. 100TB sort achieved in 23 mins Spark released 13

Available on github Spark donated to Apache Software Foundation

14 Contributors in Spark Yahoo Intel UC Berkeley 50+ organizations 14

15 Hadoop and Spark Spark complements the Hadoop ecosystem Replaces: Hadoop MR Spark integrates with HDFS Hive HBase YARN 15

16 Other big data tools Spark also integrates with Kafka ZeroMQ Cassandra Mesos 16

17 Programming Spark Java Scala Python R 17

18 Spark toolset Spark Cassandra Spark SQL Spark Streamin g MLlib GraphX Spark R Blink DB Apache Spark Tachyon 18

19 What is Spark for? Batch Interactive Streaming 19

20 The main difference: speed RAM access vs Disk access RAM access is 100,000 times faster! 20

21 Lambda Architecture pattern Used for Lambda architecture implementation Batch layer Speed layer Serving layer Speed Layer Data Input Batch Layer Data consumers Serving Layer 21

22 Deployment Architecture Spark Driver Spark s Cluster Manager HDFS Name Node Master Node Task Task Task Executor Executor Cache Worker Node Cache Executor Worker Node HDFS Data Node HDFS Data Node 22

23 APACHE ZEPPELIN

24 Interactive data analytics For Spark and Flink Web front end At the back, it connects to SQL systems(eg: Hive) Spark Flink 24

25 Deployment Architecture Web browser 1 Web Server Optional Spark / Flink / Hive Web browser 2 Web browser 3 Local Interpreters Zeppelin daemon Remote Interpreters 25

26 Notebook Is where you do your data analysis Web UI REPL with pluggable interpreters Interpreters Scala, Python, Angular, SparkSQL, Markdown and Shell 26

27 User Interface features Markdown Dynamic HTML generation Dynamic chart generation Screen sharing via websockets 27

28 SQL Interpreter SQL shell Query spark data using SQL queries Return normal text, HTML or chart type results 28

29 Scala interpreter for Spark Similar to the Spark shell Upload your data into Spark Query the data sets(rdds) in your Spark server Execute map-reduce tasks Actions on RDD Transformations on RDD 29

30 DATA VISUALIZATION

31 Visualization tools Source: 31

32 D3 Visualizations Source: 32

33 The need for visualization Big Data Do something to data User gets comprehensible data 33

34 Tools for Data Presentation Architecture A data analysis tool/toolset would support: 5.Present 4.Format 3.Manipulate 2.Locate 1.Identify 34

35 COMBINING SPARK AND ZEPPELIN

36 Spark and Zeppelin Web browser 1 Web Server Spark Worker Node Web browser 2 Spark Master Node Web browser 3 Local Interpreters Zeppelin daemon Remote Interpreters Spark Worker Node 36

37 Zeppelin views: Table from SQL 37

38 Zeppelin views: Table from SQL %sql select age, count(1) from bank where marital="${marital=single,single divorced married}" group by age order by age 38

39 Zeppelin views: Pie chart from SQL 39

40 Zeppelin views: Bar chart from SQL 40

41 Zeppelin views: Angular 41

42 Share variables: MVVM Between Scala/Python/Spark and Angular Observe scala variables from angular Zeppelin x = foo Scala-Spark x = bar Angular 42

43 Screen sharing using Zeppelin Share your graphical reports Live sharing Get the share URL from zeppelin and share with others Uses websockets Embed live reports in web pages 43

44 FUTURE

45 Spark and Zeppelin Spark Machine Learning using Spark GraphX and MLlib Zeppelin Additional interpreters Better graphics Report persistence More report templates Better angular integration 45

46 SUMMARY

47 Summary Spark and tools The need for visualization The role of Zeppelin Zeppelin Spark integration 47

How Companies are! Using Spark

How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made