In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet



Similar documents
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Next-Gen Big Data Analytics using the Spark stack

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Unified Big Data Processing with Apache Spark. Matei

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Unified Big Data Analytics Pipeline. 连 城

How Companies are! Using Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop & Spark Using Amazon EMR

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Ecosystem B Y R A H I M A.

Moving From Hadoop to Spark

Beyond Hadoop with Apache Spark and BDAS

An Open Source Memory-Centric Distributed Storage System

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

The Internet of Things and Big Data: Intro

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Job Oriented Training Agenda

HiBench Introduction. Carson Wang Software & Services Group

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Significantly Speed up real world big data Applications using Apache Spark

COURSE CONTENT Big Data and Hadoop Training

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Research in the AMPLab: BDAS and Beyond

Databricks. A Primer

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Complete Java Classes Hadoop Syllabus Contact No:

Native Connectivity to Big Data Sources in MSTR 10

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop: The Definitive Guide

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Conquering Big Data with BDAS (Berkeley Data Analytics)

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Flink Next-gen data analysis. Kostas

Implement Hadoop jobs to extract business value from large and varied data sets

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Apache Spark and Distributed Programming

Databricks. A Primer

Processing NGS Data with Hadoop-BAM and SeqPig

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Integrating Apache Spark with an Enterprise Data Warehouse

Constructing a Data Lake: Hadoop and Oracle Database United!

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

HDFS. Hadoop Distributed File System

CSE-E5430 Scalable Cloud Computing Lecture 11

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Analytics on Spark &

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Alternatives to HIVE SQL in Hadoop File Structure

Big Data Analytics with Cassandra, Spark & MLLib

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Processing. Patrick Wendell Databricks

This is a brief tutorial that explains the basics of Spark SQL programming.

Harnessing Big Data with KNIME

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Ali Ghodsi Head of PM and Engineering Databricks

Upcoming Announcements

Where is Hadoop Going Next?

SEIZE THE DATA SEIZE THE DATA. 2015

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

How to Choose Between Hadoop, NoSQL and RDBMS

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

HADOOP. Revised 10/19/2015

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Wisdom from Crowds of Machines

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Move Data from Oracle to Hadoop and Gain New Business Insights

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spark. Fast, Interactive, Language- Integrated Cluster Computing

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Architectures for massive data management

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Workshop on Hadoop with Big Data

Shark Installation Guide Week 3 Report. Ankush Arora

Spark: Making Big Data Interactive & Real-Time

Big Data Analytics Nokia

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Transcription:

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015

Big data analytics / machine learning 6+ years with Hadoop ecosystem 2 years with Spark http://atigeo.com/ A research group that focuses on the technical problems that exist in the big data industry and provides open source solutions http://bigdataresearch.io/

Agenda Intro Use Case Data pipeline with Spark Spark Job Rest Service Spark SQL Rest Service (Jaws) Parquet Tachyon Demo

Use Case Build an in memory data pipeline for millions financial transactions used downstream by data scientists for detecting fraud Ingestion from S3 to our Tachyon/HDFS cluster Data transformation Data warehouse

Apache Spark fast and general engine for large-scale data processing Built around the concept of RDD API for Java/Scala/Python (80 operators) powers a stack of high level tools including Spark SQL, MLlib, Spark Streaming.

Public S3 Bucket: public-financial-transactions scheme scheme.csv data input-0.csv public-financialtransactions (s3-bucket) input-1.csv... data2...

1. Ingestion Download from S3 Resolving the wildcards means listing files metadata Listing the metadata for a large number of files from external sources can take a long time

Listing the metadata (distributed) Driver folder1 folder2 folder3 folder4 folder5 folder6 Worker Worker Worker folder1 folder2 file-11 file-12 file-21 file-22 file-23 folder3 folder4 file-31 file-32 file-41 file-42 file-43 file-44 folder5 folder6 file-51 file-52 file-61

Listing the metadata (distributed) For fine tuning, specify the number of partitions

Download Files Unbalanced partitions

Unbalanced partitions Partition 0 transactions.csv Partition 1 input.csv data.csv values.csv buzzwords.csv buzzwords.txt

Balancing partitions Partition 0 (0, transactions.csv) (2, data.csv) (4, buzzwords.csv) Partition 1 (1, input.csv) (3, values.csv) (5, buzzwords.txt)

Balancing partitions Balancing partitions Keep in mind that repartitioning your data is a fairly expensive operation.

2. Data Transformation Data cleaning is the first step in any data science project For this use-case: - Remove lines that don't match the structure - Remove useless columns - Transform data to be in a consistent format

Find Country char code Numeric Format 276 DE Alpha 2 Format Name Germany Join Problem with skew in the key distribution

Metrics for Join

Find Country char code Broadcast Country Codes Map

Metrics

Transformation with Join vs Broadcasted Map (skewed key) Join Broadcasted Map 300 240 Seconds 180 120 60 0 1 Million 2 Million 3 Million Rows

Spark-Job-Rest https://github.com/atigeo/spark-job-rest Supports multiple contexts Launches a new process for each Spark context Inter-process communication with Akka actors Easy context creation & job runs Supports Java and Scala code Friendly UI

Build a data warehouse Hive Apache Pig Impala Presto Stinger (Hive on Tez) Spark SQL

Spark SQL HIVE QL SQL Rich language interfaces RDD-aware optimizer DataFrame / SchemaRDD RDD JDBC Support for multiple input formats

Creating a data frame

Explore data Perform a simple query: > Directly on the data frame - select - filter - join - groupby - agg - join - count - sort - where..etc. > Registering a temporary table

Creating a data warehouse https://github.com/atigeo/xpatterns-spark-parquet

TextFile SequenceFile File Formats RCFile (RowColumnar) ORCFile (OptimizedRowColumnar) Avro Parquet > columnar format > good for aggregation queries > only the required columns are read from disk > nested data structures > schema with the data > spark sql supports schema evolution > efficient compression

Tachyon memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks Pluggable underlayer file system: hdfs, S3,

Caching in Spark SQL Cache data in columnar format Automatically compression tune

Spark cache vs Tachyon spark context might crash GC kicks in share data between different applications

Jaws spark sql rest - Highly scalable and resilient data warehouse - Submit queries concurrently and asynchronously - Restful alternative to Spark SQL JDBC having a interactive UI - Since Spark 091 with Shark - Support for Spark SQL and Hive - MR (and more to come) https://github.com/atigeo/jaws-spark-sql-rest

Jaws main features - Akka actors to communicate through instances - Support cancel queries - Supports large results retrieval - Parquet in memory warehouse - returns persisted logs, results, query history - provides a metadata browser - configuration file to fine tune spark

Code available at https://github.com/big-data-research/in-memory-data-pipeline

Q & A

2013 Atigeo, LLC. All rights reserved. Atigeo and the xpatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.