High-Speed In-Memory Analytics over Hadoop and Hive Data
|
|
- Bernice Johnston
- 8 years ago
- Views:
Transcription
1 High-Speed In-Memory Analytics over Hadoop and Hive Data Big Data 2015
2 Apache Spark Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General execution graphs and powerful optimizations Up to 40x faster than Hadoop Compatible with Hadoop s storage APIs Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc
3 Apache Spark iter.&1& iter.&2&."".""." Input& one6time& processing& query&1& query&2& Input& Distributed& memory& query&3&."".""." 10%100 "faster&than&network&and&disk&
4 Users
5 Shark Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x
6 Shark % %%Client% CLI% JDBC% Meta% store% SQL% Parser% Driver% Query% Optimizer% Cache%Mgr.% Physical%Plan% Execution% Spark% HDFS%
7 Software Stack Shark& (Hive&on&Spark)& Bagel& (Pregel&on&Spark)& Streaming& Spark& " Spark& Local& mode& EC2& Apache& Mesos& YARN&
8 Spark Configuration Download a binary release of apache Spark: spark bin-hadoop2.6.tgz
9 Spark Configuration In the conf directory of spark-home directory set (IN CASE) spark-env.sh file
10 Shark Configuration Shark has been subsumed by Spark SQL, a new module in Apache Spark:
11 Spark Running Running Spark Shell [scala]: $:~spark-*/bin/spark-shell Running Spark Shell [python]: $:~spark-*/bin/pyspark Spark Shell - Scala Welcome to / / / / \ \/ _ \/ _ `/ / '_/ / /. /\_,_/_/ /_/\_\ version /_/ Using Scala version (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_05) Type in expressions to have them evaluated. scala>
12 Spark Self-contained applications Java Spark API import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; import org.apache.spark.api.java.function.function; public class SimpleApp { public static void main(string[] args) { String logfile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logdata = sc.textfile(logfile).cache(); long numas = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("a"); } }).count(); long numbs = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numas + ", lines with b: " + numbs); } }
13 Spark Self-contained applications Java Spark API: configuration of Spark application! String logfile = "YOUR_SPARK_HOME/README.md"; SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logdata = sc.textfile(logfile).cache();
14 Spark Self-contained applications Java Spark API: Spark actions! long numas = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("a"); } }).count(); long numbs = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("b"); } }).count();
15 Spark Self-contained applications SimpleApp.java create logdata: an Object like [line1, line2, line3,...] sopra la panca la capra campa, sotto la panca la capra crepa! Lines with a: 1, lines with b: 0
16 Spark Self-contained applications pom.xml Maven Project <project> <modelversion>4.0.0</modelversion> <groupid>sparkproject</groupid> <artifactid>sparkproject</artifactid> <name>simple Project</name> <packaging>jar</packaging> <version>1</version> <dependencies> <dependency> <!-- Spark dependency --> <groupid>org.apache.spark</groupid> <artifactid>spark-core_2.10</artifactid> <version>1.3.1</version> </dependency> <dependency> <groupid>org.apache.spark</groupid> <artifactid>spark-sql_2.10</artifactid> <version>1.3.1</version> </dependency> </dependencies> </project>
17 Spark Running - standalone Running Java Spark applications: $:~spark-*/bin/spark-submit --class "SimpleApp" --master local[4] SparkProject-1.0.jar output [terminal] Lines with a: 46, Lines with b: 23
18 Spark Running - standalone Running Java Spark applications exporting output in a text file: $:~spark-*/bin/spark-submit --class "SimpleApp" --master local[4] SparkProject-1.0.jar > output.txt output.txt Lines with a: 46, Lines with b: 23
19 Exercises Word Counting Filtering Log files Computing page rank of web sites Computing Pi value Computing Transitive closure of a graph Querying structured data via SHARK (Spark-SQL)
20 Spark Self-contained applications JavaWordCount.java create words: an Object like [word1, word2, word3,...] sopra la panca la capra campa, sotto la panca la capra crepa create ones: an Object like [(word1,1), (word2,1), (word3,1),...] (sopra,1) (la,1) (panca,1) (la,1) (capra,1) (campa,,1) (sotto,1) (la,1) (panca,1) (la,1) (capra,1) (crepa,1) Result:! Counting: 1 1 Counting: 2 1 Counting: 1 1 Counting: 3 1 Counting: 1 1! panca: 2 la: 4 campa,: 1 sotto: 1 crepa: 1 sopra: 1 capra: 2
21 Spark Self-contained applications JavaSimpleApp2: count for each line how many words contain letter a and for each line how many words contain letter b
22 Spark Self-contained applications JavaPageRank.java Give pages ranks (scores) based on links to them Basic'Idea'» Links from many pagesè high rank Give&pages&ranks&(scores)&based&on&links&to&them&» Link from a high- rank pageè high rank» Links&from&many&pages&!&high&rank&» Link&from&a&high4rank&page&!&high&rank& Image:&en.wikipedia.org/wiki/File:PageRank4hi4res42.png&&
23 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.0& 1.0& 1.0& 1.0&
24 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.0& 1& 1.0& 1& 0.5& 0.5& 0.5& 1.0& 0.5& 1.0&
25 Spark Self-contained applications JavaPageRank.java Algorithm' 1. Start each page at a rank of 1 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.85& 0.58& 1.0& 0.58&
26 Spark Self-contained applications JavaPageRank.java Algorithm' 1. Start each page at a rank of 1 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 0.58& 1.85& 0.58& 0.29& 0.29& 0.58& 0.5& 1.85& 0.5& 1.0&
27 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& rank p / neighbors p to its neighbors 3. Set each page s rank to contribs & 1.31& 0.39&.'.'.' 1.72& 0.58&
28 Spark Self-contained applications JavaPageRank.java 1. Start each page at a rank of 1 Algorithm' 2. On each iteration, have page p contribute 1. Start&each&page&at&a&rank&of&1& 2. On&each&iteration,&have&page&p&contribute& rank p / neighbors p to its neighbors 3. Set each rank page s rank contribs p &/& neighbors p &to&its&neighbors& 3. Set&each&page s&rank&to&0.15&+&0.85& &contribs& & Final'state:' 1.44& 0.46& 1.37& 0.73&
29 Resources
30 High-Speed In-Memory Analytics over Hadoop and Hive Data Big Data 2015
Data Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationApache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationThis is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
More informationSpark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data
Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,
More informationShark Installation Guide Week 3 Report. Ankush Arora
Shark Installation Guide Week 3 Report Ankush Arora Last Updated: May 31,2014 CONTENTS Contents 1 Introduction 1 1.1 Shark..................................... 1 1.2 Apache Spark.................................
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationWriting Standalone Spark Programs
Writing Standalone Spark Programs Matei Zaharia UC Berkeley www.spark- project.org UC BERKELEY Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging Building
More informationHive Interview Questions
HADOOPEXAM LEARNING RESOURCES Hive Interview Questions www.hadoopexam.com Please visit www.hadoopexam.com for various resources for BigData/Hadoop/Cassandra/MongoDB/Node.js/Scala etc. 1 Professional Training
More informationSpring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationUsing Apache Spark Pat McDonough - Databricks
Using Apache Spark Pat McDonough - Databricks Apache Spark spark.incubator.apache.org github.com/apache/incubator-spark user@spark.incubator.apache.org The Spark Community +You! INTRODUCTION TO APACHE
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationApache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
More informationSpark and Shark: High-speed In-memory Analytics over Hadoop Data
Spark and Shark: High-speed In-memory Analytics over Hadoop Data May 14, 2013 @ Oracle Reynold Xin, AMPLab, UC Berkeley The Big Data Problem Data is growing faster than computation speeds Accelerating
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationConquering Big Data with Apache Spark
Conquering Big Data with Apache Spark Ion Stoica November 1 st, 2015 UC BERKELEY The Berkeley AMPLab January 2011 2017 8 faculty > 50 students 3 software engineer team Organized for collaboration achines
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationIntroduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationbrief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
More informationBeyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationE6895 Advanced Big Data Analytics Lecture 4:! Data Store
E6895 Advanced Big Data Analytics Lecture 4:! Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics,
More informationNext-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
More informationSpark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
More informationBig Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationSurvey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
More informationNetworks and Services
Networks and Services Dr. Mohamed Abdelwahab Saleh IET-Networks, GUC Fall 2015 TOC 1 Infrastructure as a Service 2 Platform as a Service 3 Software as a Service Infrastructure as a Service Definition Infrastructure
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationApache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com
Apache Sentry Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture
More informationSpark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
More informationSpark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
More informationWhy Spark on Hadoop Matters
Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More information07/11/2014 Julien! Poorna! Andreas
Ad-hoc Query Brown Bag Session 07/11/2014 Julien Poorna Andreas User Story Procedures are only developer friendly and not ad-hoc Open datasets to broader audience of non developers Introduce schema to
More informationBig Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
More informationMicrosoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide
Microsoft SQL Server Connector for Apache Hadoop Version 1.0 User Guide October 3, 2011 Contents Legal Notice... 3 Introduction... 4 What is SQL Server-Hadoop Connector?... 4 What is Sqoop?... 4 Supported
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationIntroduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationHadoop 2.6 Configuration and More Examples
Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies
More informationSpark Application Carousel. Spark Summit East 2015
Spark Application Carousel Spark Summit East 2015 About Today s Talk About Me: Vida Ha - Solutions Engineer at Databricks. Goal: For beginning/early intermediate Spark Developers. Motivate you to start
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationHadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationWorking With Derby. Version 10.2 Derby Document build: December 11, 2006, 7:06:09 AM (PST)
Working With Derby Version 10.2 Derby Document build: December 11, 2006, 7:06:09 AM (PST) Contents Copyright...3 Introduction and prerequisites...4 Activity overview... 5 Activity 1: Run SQL using the
More informationOracle Big Data Spatial & Graph Social Network Analysis - Case Study
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 info@rittmanmead.com www.rittmanmead.com @rittmanmead About the Speaker Mark
More informationReal Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
More informationIntegrate Master Data with Big Data using Oracle Table Access for Hadoop
Integrate Master Data with Big Data using Oracle Table Access for Hadoop Kuassi Mensah Oracle Corporation Redwood Shores, CA, USA Keywords: Hadoop, BigData, Hive SQL, Spark SQL, HCatalog, StorageHandler
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationHareDB HBase Client Web Version USER MANUAL HAREDB TEAM
2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationIntegration of Apache Hive and HBase
Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationHADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
More informationLeveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
More informationUnified Data Access with Spark SQL. Michael Armbrust Spark Summit 2014 @michaelarmbrust
Unified Data Access with Spark SQL Michael Armbrust Spark Summit 2014 @michaelarmbrust Spark SQL Components 38%! 36%! 26%! Catalyst Optimizer Relational algebra + expressions Query optimization Spark SQL
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationBig Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at spoozhikala@stratapps.com.
More informationHadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
More informationComplete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
More informationOracle Big Data Fundamentals Ed 1 NEW
Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big
More informationUnlocking Hadoop for Your Rela4onal DB. Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.
Unlocking Hadoop for Your Rela4onal DB Kathleen Ting @kate_ting Technical Account Manager, Cloudera Sqoop PMC Member BigData.be April 4, 2014 Who Am I? Started 3 yr ago as 1 st Cloudera Support Eng Now
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationFast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
More informationArchitecting the Future of Big Data
Hive ODBC Driver User Guide Revised: July 22, 2013 2012-2013 Hortonworks Inc. All Rights Reserved. Parts of this Program and Documentation include proprietary software and content that is copyrighted and
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationBig Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce
Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates
More informationHADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara
DRIVING INNOVATION THROUGH DATA HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara ABOUT ME I am a Data Engineer, not a Data
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationSQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
More informationHow to Run Spark Application
How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationIntroduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
More informationCustomer Case Study. Sharethrough
Customer Case Study Customer Case Study Benefits Faster prototyping of new applications Easier debugging of complex pipelines Improved overall engineering team productivity Summary offers a robust advertising
More informationPro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More information