Enterprise Data Storage and Analysis on Tim Barr

Similar documents
BIG DATA APPLICATIONS

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Mrs: MapReduce for Scientific Computing in Python

Unified Big Data Analytics Pipeline. 连 城

Unified Big Data Processing with Apache Spark. Matei

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Moving From Hadoop to Spark

Big Data 2012 Hadoop Tutorial

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

HiBench Introduction. Carson Wang Software & Services Group

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark and the Big Data Library

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Big Data for the JVM developer. Costin Leau,

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics. Lucas Rego Drumond

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Word Count Code using MR2 Classes and API

Getting to know Apache Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Integrating Apache Spark with an Enterprise Data Warehouse

Beyond Hadoop with Apache Spark and BDAS

The Internet of Things and Big Data: Intro

The Hadoop Eco System Shanghai Data Science Meetup

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introduc8on to Apache Spark

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Apache Flink Next-gen data analysis. Kostas

Hadoop Basics with InfoSphere BigInsights

Hadoop IST 734 SS CHUNG

Spark: Making Big Data Interactive & Real-Time

How Companies are! Using Spark

Extreme Computing. Hadoop MapReduce in more detail.

How To Create A Data Visualization With Apache Spark And Zeppelin

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data With Hadoop

HPCHadoop: MapReduce on Cray X-series

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Connecting Hadoop with Oracle Database

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

An Introduction to Apostolos N. Papadopoulos

Hadoop & Spark Using Amazon EMR

Hadoop Ecosystem B Y R A H I M A.

Architectures for massive data management

Apache Spark and Distributed Programming

Ali Ghodsi Head of PM and Engineering Databricks

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Internals of Hadoop Application Framework and Distributed File System

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Outline of Tutorial. Hadoop and Pig Overview Hands-on

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Machine- Learning Summer School

Introduction to Spark

CS54100: Database Systems

Advanced Big Data Analytics with R and Hadoop

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Spark Application Carousel. Spark Summit East 2015

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data Analytics with Cassandra, Spark & MLLib

CSE-E5430 Scalable Cloud Computing Lecture 11

Introduction to MapReduce and Hadoop

IBM Big Data Platform

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Oracle Big Data SQL Technical Update

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop WordCount Explained! IT332 Distributed Systems

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark: Cluster Computing with Working Sets

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Apache Flink. Fast and Reliable Large-Scale Data Processing

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

BIG DATA What it is and how to use?

Next-Gen Big Data Analytics using the Spark stack

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How To Scale Out Of A Nosql Database

Big Data Analytics Hadoop and Spark

The Stratosphere Big Data Analytics Platform

CSE-E5430 Scalable Cloud Computing Lecture 2

Transcription:

Enterprise Data Storage and Analysis on Tim Barr January 15, 2015

Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib, and GraphX Graphs for CyberAnalytics Hybrid Spark Architecture Why you should love Scala Q&A 2

3 Challenges in Big Data Analytics

Emergence of Latency-Sensitive Analytics Response time frames <30ms 30ms 10min >10min streaming data tweets event logs IoT SQL/ad hoc queries BI visualization exploration summarization aggregation indexing ETL Low-Latency Batch Higher performance and more innovative use of memorystorage hierarchies and interconnects required here Hadoop tomorrow Hadoop today Performance optimizations 4

Focus on Analytic Productivity Time to Value Is the Key Performance Metric Stand up big data clusters Sizing Provisioning Configuration Tuning Workload management Move into production Move data Copy, load, replication Multiple data sources Fighting data gravity Map Data prep Cleansing Merge data Apply schema Shuffle Reduce Analyze Multiple frameworks Analytics pipeline Job run time is a fraction of the total Time to Value Apply results Scoring Reports Apply to next stage 5

Integrated Advanced Analytics Data Prep Basic profiling Statistics Machine Learning Streaming Batch Analytics Iterative Analytics Interactive queries Every record in a dataset once Same Subset of records several times Different subsets each time In the real-world, advanced analytics needs multiple, integrated toolsets These toolsets require very different computing demands 6 6

Why many Hadoop Deployments Under Deliver Data scientists are critical, but in short supply Shortage of big data tools Complexity of the MapReduce programming environment Cost of Analytic value currently too high MapReduce performance does not allow the analyst to follow his/her nose Spark is often installed on existing under powered Hadoop clusters leading to undesirable performance 7

Hadoop: Great Promise but with Challenges Forbes Article: How to Avoid a Hadoop Hangover Hadoop is hard to set up, use, and maintain. In and of itself, grid computing is difficult, and Hadoop doesn t make it any easier. Hadoop is still maturing from a developer s standpoint, let alone from the standpoint of a business user. Because only savvy Silicon Valley engineers can derive value Hadoop, it s not going to make inroads into larger organizations without a lot of handholding and professional services. Mike Driscoll, CEO of Metamarkets 8 http://www.forbes.com/sites/danwoods/2012/07/27/how-to-avoid-a-hadoop-hangover/

Hadoop: Perception versus Reality Current Perception of Hadoop Synonymous with Big Data and openness Capable of huge scale with ad-hoc infrastructure Current Reality of Hadoop Many experimenting Much expertise in Warehousing little beyond that Data Scientist bottleneck performance not yet an issue Current Trajectory of Hadoop Industry Momentum Universities, Govt., Private firms, etc. More Users Beyond Data scientists, Domain Scientists, analysts, etc. More Complexity Multi-layered files, complex algorithms, etc. 9 Hadoop widely perceived as high potential, not yet high value, but that s about to change

What is Spark? Distributed data analytics engine, generalizing MapReduce Core engine, with streaming, SQL, machine learning, and graph processing modules Program in Python, Scala, and/or Java 10

Spark - Resilient Distributed Dataset (RDD) Distributed collection of objects Benefits of RDDs? RDDs exist in-memory Built via parallel transformations (map, filter, ) RDDs are automatically rebuilt on failure There are two ways to create RDDs: Parallelizing an existing collection in your driver program Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. 11

Benefits of a Unified Platform No copying or ETL of data between systems Combine processing types in one program Code reuse One system to learn One system to maintain 12

Spark SQL Unified data access with with SchemaRDDs Tables are a representation of (Schema + Data) = SchemaRDD Hive Compatibility Standard Connectivity via ODBC and/or JDBC 13

Spark Streaming Time RDD RDD RDD RDD RDD RDD Spark Streaming expresses streams as a series of RDDs over time Combine streaming with batch and interactive queries Stateful and Fault Tolerant 14

15 Spark Streaming Inputs/Outputs

Spark Machine Learning Iterative computation Vectors, Matrices = RDD[Vector] Current MLlib 1.1 Algorithms 16 linear SVM and logistic regression classification and regression tree k-means clustering recommendation via alternating least squares singular value decomposition linear regression with L1- and L2-regularization multinomial naive Bayes basic statistics feature transformations

Spark GraphX Unifies graphs with RDDs of edges and vertices View the same data as both graphs and collections Custom iterative graph algorithms via Pregel API Current GraphX Algorithms PageRank Connected components Label propagation SVD++ Strongly connected components Triangle count 17

Applying Graphs to CyberAnalytics Use the graph as a pre-merged perspective of all the available data sets Graphs enable discovery It s called a network! represent that information in the more natural and appropriate format Graphs are optimized to show the relationships present in metadata fail fast, fail cheap choose a graph engine that supports rapid hypothesis testing Returning answers before the analyst forgets why he asked them, this enables the investigative discovery flow Using this framework, analysts can more easily and more quickly find unusual things this matters significantly when there is the constant threat of new unusual things When all focus is no longer on dealing with the known, there is bandwidth for discovery When all data can be analyzed in a holistic manner, new patterns and relationships can be seen 18

Using Graph Analysis to Identify Patterns Example mature cyber-security questions Who hacked us? What did they touch in our network? Where else did they go? What unknown botnets are we hosting? What are the vulnerabilities in our network configuration? Who are the key influencers in the company / on the network? What s weird that s happening on the network? Proven graph algorithms help answer these questions Subgraph identification Alias identification Shortest-path identification Common-node identification Clustering / community identification Graph-based cyber-security discovery environment 19 Analytic tradecraft and algorithms mature together General questions require swiss army knives Specific, well-understood questions use exacto knives

Spark System Requirements Storage Systems It is important to place it as close to this system as possible. If at all possible, run Spark on the same nodes as HDFS. The simplest way is to set up a Spark standalone mode cluster on the same nodes, and configure Spark and Hadoop s memory and CPU usage to avoid interference Local Disks While Spark can perform a lot of its computation in memory, it still uses local disks to store data that doesn t fit in RAM, as well as to preserve intermediate output between stages. We recommend having 4-8 disks per node, configured without RAID 20 https://spark.apache.org/docs/latest/hardware-provisioning.html

Spark System Requirements (continued) Memory Spark runs well with anywhere from 8 GB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache. Network When the data is in memory, a lot of Spark applications are network-bound. Using a 10 Gigabit or higher network is the best way to make these applications faster. This is especially true for distributed reduce applications such as group-bys, reduce-bys, and SQL joins. CPU Cores Spark scales well to tens of CPU cores per machine because it performs minimal sharing between threads. You should likely provision at least 8-16 cores per machine. https://spark.apache.org/docs/latest/hardware-provisioning.html

Benefits of HDFS Scale-Out Architecture: Add servers to increase capacity High Availability: Serve mission-critical workflows and applications Fault Tolerance: Automatically and seamlessly recover from failures Flexible Access: Multiple and open frameworks for serialization and file system mounts Load Balancing: Place data intelligently for maximum efficiency and utilization Configurable Replication: Multiple copies of each file provide data protection and computational performance HDFS Sequence Files A Sequence file is a data structure for binary key-value pairs. it can be used as a common format to transfer data between MapReduce jobs. Another important advantage of a sequence file is that it can be used as an archive to pack smaller files. 22

Hybrid Spark Architecture Apache Spark should be complimentary to your existing architecture enhance existing system capabilities assume some of the analytic workload handle archive storage 23

Spark Performance In-Memory Performance Active Open Source Community Order of Magnitude Graph Performance 24

Performance Spark wins Daytona Gray Sort 100TB Benchmark They used Spark and sorted 100TB of data using 206 EC2 i2.8xlarge machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2100 nodes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark s in-memory cache. Outperforming large Hadoop MapReduce clusters on sorting not only validates the vision and work done by the Spark community, but also demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes. 25 https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html

26 Why you should love Scala (If you don t already)

Word Count Example Spark Scala val file = spark.textfile("hdfs://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("hdfs://...") 27

Word Count Example MapReduce import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.filesplit; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.util.genericoptionsparser; 28

Word Count Example MapReduce (continued) public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); 29 public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); }

Word Count Example MapReduce (continued) } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); 30 public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

Word Count Example MapReduce (continued) } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); 31 job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class);

Word Count Example MapReduce (continued) job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); } } System.exit(job.waitForCompletion(true)? 0 : 1); 5 Lines of Spark Scala Code vs. 57 Lines of MapReduce Code 32

Useful Resources Apache Spark https://spark.apache.org/ Spark Summit 2014 http://spark-summit.org/2014 Apache Spark Reference Card http://refcardz.dzone.com/refcardz/apache-spark Apache Spark Meetups http://spark.meetup.com/ 33

34 Thank You! Questions?