How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5



Similar documents
How Companies are! Using Spark

Moving From Hadoop to Spark

Unified Big Data Processing with Apache Spark. Matei

Ali Ghodsi Head of PM and Engineering Databricks

Large scale processing using Hadoop. Ján Vaňo

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Architectures for massive data management

Hadoop Ecosystem B Y R A H I M A.

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Case Study : 3 different hadoop cluster deployments

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Beyond Hadoop with Apache Spark and BDAS

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Next-Gen Big Data Analytics using the Spark stack

Hadoop implementation of MapReduce computational model. Ján Vaňo

Workshop on Hadoop with Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

An Open Source Memory-Centric Distributed Storage System

Big Data Analytics Hadoop and Spark

Spark: Making Big Data Interactive & Real-Time

Analytics on Spark &

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Analytics. Lucas Rego Drumond

How To Scale Out Of A Nosql Database

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

HDP Hadoop From concept to deployment.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Unified Big Data Analytics Pipeline. 连 城

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HADOOP. Revised 10/19/2015

CSE-E5430 Scalable Cloud Computing Lecture 11

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Hadoop & Spark Using Amazon EMR

TECHNOLOGY TRANSFER PRESENTS INTERNATIONAL. Rome, December Residenza di Ripetta Via di Ripetta, 231 CONFERENCE BIG DATA

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Hadoop IST 734 SS CHUNG

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Conquering Big Data with BDAS (Berkeley Data Analytics)

Real Time Data Processing using Spark Streaming

Apache Flink Next-gen data analysis. Kostas

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Brave New World: Hadoop vs. Spark

HiBench Introduction. Carson Wang Software & Services Group

Introduction to Big Data Training

The Internet of Things and Big Data: Intro

The Future of Data Management

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Hadoop. Sunday, November 25, 12

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

HDP Enabling the Modern Data Architecture

Apache Zeppelin, the missing component for your BigData ecosystem

Big Data. Lyle Ungar, University of Pennsylvania

Implement Hadoop jobs to extract business value from large and varied data sets

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Spark and the Big Data Library

Application Development. A Paradigm Shift

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Dominik Wagenknecht Accenture

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Write Once, Run Anywhere Pat McDonough

Upcoming Announcements

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Research in the AMPLab: BDAS and Beyond

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Shark Installation Guide Week 3 Report. Ankush Arora

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Certified Big Data and Apache Hadoop Developer VS-1221

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Business Intelligence for Big Data

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Oracle Big Data SQL Technical Update

Big Data Course Highlights

Hadoop Big Data for Processing Data and Performing Workload

Cost-Effective Business Intelligence with Red Hat and Open Source

Peers Techno log ies Pv t. L td. HADOOP

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

NextGen Infrastructure for Big DATA Analytics.

Transcription:

Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro

Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark and Zeppelin 2

BIG DATA AND ECOSYSTEM TOOLS

Big Data Data size beyond systems capability Terabyte, Petabyte, Exabyte Storage Commodity servers, RAID, SAN Processing In reasonable response time A challenge here 4

Tradition processing tools Move what? the data to the code or the code to the data move data to code Data move code to data Code Server Server 5

Traditional processing tools Traditional tools RDBMS, DWH, BI High cost Difficult to scale beyond certain data size price/performance skew data variety not supported 6

Map-Reduce and NoSQL Hadoop toolset Free and open source Commodity hardware Scales to exabytes(10 18 ), maybe even more Not only SQL Storage and query processing only Complements Hadoop toolset Volume, velocity and variety 7

All is well? Hadoop was designed for batch processing Disk based processing: slow Many tools to enhance Hadoop s capabilities Distributed cache, Haloop, Hive, HBase Not for interactive and iterative 8

TOWARDS SINGULARITY

AI capacity What is singularity? 8000 Decade vs AI capacity 7000 6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 Decade Point of singularity 10

Technological singularity When AI capability exceeds Human capacity AI or non-ai singularity 2045: http://en.wikipedia.org/wiki/ray_kurzweil The predicted year 11

APACHE SPARK

History of Spark 2009 2010 2013 2014 2015 March Spark created by PhD student at UC Berkeley, Matei Zaharia Spark is made open source. Available on github Spark donated to Apache Software Foundation Spark 1.0.0 released. 100TB sort achieved in 23 mins Spark 1.3.0 released 13

Contributors in Spark Yahoo Intel UC Berkeley 50+ organizations 14

Hadoop and Spark Spark complements the Hadoop ecosystem Replaces: Hadoop MR Spark integrates with HDFS Hive HBase YARN 15

Other big data tools Spark also integrates with Kafka ZeroMQ Cassandra Mesos 16

Programming Spark Java Scala Python R 17

Spark toolset Spark Cassandra Spark SQL Spark Streamin g MLlib GraphX Spark R Blink DB Apache Spark Tachyon 18

What is Spark for? Batch Interactive Streaming 19

The main difference: speed RAM access vs Disk access RAM access is 100,000 times faster! https://gist.github.com/hellerbarde/2843375 20

Lambda Architecture pattern Used for Lambda architecture implementation Batch layer Speed layer Serving layer Speed Layer Data Input Batch Layer Data consumers Serving Layer 21

Deployment Architecture Spark Driver Spark s Cluster Manager HDFS Name Node Master Node Task Task Task Executor Executor Cache Worker Node Cache Executor Worker Node HDFS Data Node HDFS Data Node 22

APACHE ZEPPELIN

Interactive data analytics For Spark and Flink Web front end At the back, it connects to SQL systems(eg: Hive) Spark Flink 24

Deployment Architecture Web browser 1 Web Server Optional Spark / Flink / Hive Web browser 2 Web browser 3 Local Interpreters Zeppelin daemon Remote Interpreters 25

Notebook Is where you do your data analysis Web UI REPL with pluggable interpreters Interpreters Scala, Python, Angular, SparkSQL, Markdown and Shell 26

User Interface features Markdown Dynamic HTML generation Dynamic chart generation Screen sharing via websockets 27

SQL Interpreter SQL shell Query spark data using SQL queries Return normal text, HTML or chart type results 28

Scala interpreter for Spark Similar to the Spark shell Upload your data into Spark Query the data sets(rdds) in your Spark server Execute map-reduce tasks Actions on RDD Transformations on RDD 29

DATA VISUALIZATION

Visualization tools Source: http://selection.datavisualization.ch/ 31

D3 Visualizations Source: https://github.com/mbostock/d3/wiki/gallery 32

The need for visualization Big Data Do something to data User gets comprehensible data 33

Tools for Data Presentation Architecture A data analysis tool/toolset would support: 5.Present 4.Format 3.Manipulate 2.Locate 1.Identify 34

COMBINING SPARK AND ZEPPELIN

Spark and Zeppelin Web browser 1 Web Server Spark Worker Node Web browser 2 Spark Master Node Web browser 3 Local Interpreters Zeppelin daemon Remote Interpreters Spark Worker Node 36

Zeppelin views: Table from SQL 37

Zeppelin views: Table from SQL %sql select age, count(1) from bank where marital="${marital=single,single divorced married}" group by age order by age 38

Zeppelin views: Pie chart from SQL 39

Zeppelin views: Bar chart from SQL 40

Zeppelin views: Angular 41

Share variables: MVVM Between Scala/Python/Spark and Angular Observe scala variables from angular Zeppelin x = foo Scala-Spark x = bar Angular 42

Screen sharing using Zeppelin Share your graphical reports Live sharing Get the share URL from zeppelin and share with others Uses websockets Embed live reports in web pages 43

FUTURE

Spark and Zeppelin Spark Machine Learning using Spark GraphX and MLlib Zeppelin Additional interpreters Better graphics Report persistence More report templates Better angular integration 45

SUMMARY

Summary Spark and tools The need for visualization The role of Zeppelin Zeppelin Spark integration 47