Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Similar documents
Moving From Hadoop to Spark

Introduction to Apache Hive

Unified Big Data Analytics Pipeline. 连 城

How To Create A Data Visualization With Apache Spark And Zeppelin

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

How Companies are! Using Spark

Introduction to Apache Hive

Architectures for massive data management

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Databricks. A Primer

Ali Ghodsi Head of PM and Engineering Databricks

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Beyond Hadoop with Apache Spark and BDAS

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Databricks. A Primer

Hadoop Ecosystem B Y R A H I M A.

Big Data for the JVM developer. Costin Leau,

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark and the Big Data Library

Unified Big Data Processing with Apache Spark. Matei

BIG DATA APPLICATIONS

Xiaoming Gao Hui Li Thilina Gunarathne

Making big data simple with Databricks

Big Data Too Big To Ignore

Analytics on Spark &

Enterprise Data Storage and Analysis on Tim Barr

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Why Spark on Hadoop Matters


Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Using distributed technologies to analyze Big Data

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Next-Gen Big Data Analytics using the Spark stack

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Spark Application Carousel. Spark Summit East 2015

Big Data Analytics. Lucas Rego Drumond

The Internet of Things and Big Data: Intro

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analysis: Apache Storm Perspective

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Introduc8on to Apache Spark

Spark: Making Big Data Interactive & Real-Time

TECHNOLOGY TRANSFER PRESENTS INTERNATIONAL. Rome, December Residenza di Ripetta Via di Ripetta, 231 CONFERENCE BIG DATA

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

HiBench Introduction. Carson Wang Software & Services Group

Real Time Data Processing using Spark Streaming

HADOOP. Revised 10/19/2015

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Conquering Big Data with BDAS (Berkeley Data Analytics)

Apache Flink Next-gen data analysis. Kostas

Apache Spark and Distributed Programming

Large scale processing using Hadoop. Ján Vaňo

Welcome to the first Workshop on Big data Open Source Systems (BOSS)

Building a data analytics platform with Hadoop, Python and R

Brave New World: Hadoop vs. Spark

Big Data Explained. An introduction to Big Data Science.

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Big Data Processing. Patrick Wendell Databricks

What s next for the Berkeley Data Analytics Stack?

Business Intelligence for Big Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Information Builders Mission & Value Proposition

Streaming items through a cluster with Spark Streaming

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Cost-Effective Business Intelligence with Red Hat and Open Source

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Unlocking the True Value of Hadoop with Open Data Science

How To Handle Big Data With A Data Scientist

Hadoop & Spark Using Amazon EMR

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Workshop on Hadoop with Big Data

Big Data and Scripting Systems build on top of Hadoop

Hadoop, Hive & Spark Tutorial

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop in the Enterprise

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

Internals of Hadoop Application Framework and Distributed File System

Big Data Analytics with Cassandra, Spark & MLLib

Transcription:

DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO

Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design & Architecture 4. Demo

Former Engineering Director of Google Apps (Google Founders Award) Former Professor and Co-Founder of the Computer Engineering program at HKUST Christopher Nguyen, PhD Adatao Inc. Co-Founder & CEO PhD Stanford, BS U.C. Berkeley Summa cum Laude Extensive experience building technology companies that solve enterprise challenges

How Have We Defined Big Data? Old Definition Big Data has Problems New Definition BIG DATA + BIG COMPUTE =Opportunitie$ Huge Volume High Velocity (Machine) Learn from Data Great Variety

Current Big Data World Batch Low-Latency + Real-Time Presentation Query & Compute Giraph Hive SQL Pig Java API Cascading MapReduce Cascading SummingBird Scalding Trident Java API Storm Shark SQL SparkR Web-Based ipython / R-Studio PySpark R Python Scala +Java DataFrame RDD Spark Scala + Java API DStream GraphX Data HDFS / YARN

CREATE EXTERNAL TABLE page_view(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '<hdfs_location>'; INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='us') SELECT pvs.viewtime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = US ;! public static DistributedRowMatrix runjob(path markovpath, Vector diag, Path outputpath, Path tmppath) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(markovPath.toUri(), conf); markovpath = fs.makequalified(markovpath); outputpath = fs.makequalified(outputpath); Path vectoroutputpath = new Path(outputPath.getParent(), "vector"); VectorCache.save(new IntWritable(Keys.DIAGONAL_CACHE_INDEX), diag, vectoroutputpath, conf);! // set up the job itself Job job = new Job(conf, "VectorMatrixMultiplication"); Maintenance is too painful! job.setinputformatclass(sequencefileinputformat.class); job.setoutputkeyclass(intwritable.class); - sample and copy data to local - lrfit <- glm( using ~ + age + education) - Export model - export model to XML - Deploy model using java

I have a dream

In the Ideal World data = load houseprice data dropna data transform (duration = now - begin) data train glm(price, bedrooms) It just works!!!

Make Big-Data API simple & accessible Java MAPREDUCE HBASE PySparkMAPREDUCE Python HBASE PySpark Java SparkR RHADOOP SparkR HIVE RHADOOP HIVE Python MONADS MONOIDS MONADS MONOIDS OR

Design Principles

DDF is Simple DDFManager manager = DDFManager.get( spark ); DDF ddf = manager.sql2ddf( select * from airline );

It s like table ddf.views.project( origin, arrdelay ); ddf.groupby( dayofweek, avg(arrdelay) ); ddf.join(otherddf)

It s just nice like this ddf.dropna()! ddf.getnumrows()! ddf.getsummary()

Focus on analytics, not MR Simple, high-level, data-science oriented APIs, powered by Spark RDD val filerdd = spark.textfile("hdfs://...") val counts = filerdd.flatmap(line => line.split(" ")).map(arrdelay_byday => (arrdelay_byday.split(, )(0), arrdelay_byday.split(, )(1))).reduceByKey(_ + _) counts.saveastextfile("hdfs://...") ddf ddf aggregate = manager.load( spark://ddf-runtime/ddf-db/airline ). ("dayofweek, sum(arrdelay)")

Quickly Access to a Rich set of familiar ML idioms Data Wrangling Model Building & Validation Model Deployment ddf.setmutable(true) ddf.dropna() ddf.transform( speed =distance/duration ) ddf.lm(0.1, 10) ddf.roc(testddf) manager.loadmodel(lm) lm.predict(point)

Seamless Integration with MLLib //plug-in algorithm linearregressionwithsgd = org.apache.spark.mllib.regression.linearregressionwithsgd! //run algorithm ddf.ml.train("linearregressionwithsgd", 10, 0.1, 0.1)

Easily Collaborate with Others Can I see your Data? DDF://com.adatao/airline

DDF on Multiple Languages

DDF Architecture Data Scientist/Engineer Owns Have access to DDF Manager Config Handler DDF ETL Handlers Statistics Handlers Representation Handlers ML Handlers

DATA INTELLIGENCE FOR ALL Demo Cluster Configuration 8 nodes x 8 cores x 30G RAM Data size 12GB/120 millions of rows

DDF offers Native R Data.Frame Experience Table-like Abstraction on Top of Big Data Focus on Analytics, not MapReduce Simple, Data-Science Oriented APIs, Powered by Spark Easily Test & Deploy New Components Pluggable Components by Design Collaborate Seamlessly & Efficiently Mutable & Sharable Work with APIs Using Preferred Languages Multi Language Support (Java, Scala, R, Python)

Example: Business Analyst Data Scientist Data Engineer Web Browser R-Studio Python PI Client PA Client DDF Client API API API HDFS

DATA INTELLIGENCE FOR ALL To learn more about Adatao & DDF contact us www.adatao.com