Google Cloud Dataflow
|
|
|
- Simon Harris
- 10 years ago
- Views:
Transcription
1 Google Cloud Dataflow Cosmin Arad, Senior Software Engineer August 7, 2015
2 Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and Discussion
3 Cloud Dataflow History of Big Data at Google Dremel MapReduce GFS Flume Spanner Pregel Big Table MillWheel Colossus
4 Big Data on Google Cloud Platform Store Capture Pub/Sub Logs App Engine BigQuery streaming Cloud Storage (objects) Cloud BigQuery Cloud SQL Storage (mysql) Datastore (NoSQL) BigTable (structured) Analyze Process Hadoop Dataflow (stream Spark (on GCE) and batch) BigQuery Hadoop Spark (on GCE) Larger Hadoop Ecosystem
5 What is Cloud Dataflow? Cloud Dataflow is a collection of SDKs for building parallelized data processing pipelines Cloud Dataflow is a managed service for executing parallelized data processing pipelines
6 Where might you use Cloud Dataflow? ETL Analysis Orchestration
7 Where might you use Cloud Dataflow? Movement Filtering Enrichment Shaping Reduction Batch computation Continuous computation Composition External orchestration Simulation
8 Dataflow SDK Concepts (Programming Model)
9
10 Dataflow SDK(s) Easily construct parallelized data processing pipelines using an intuitive set of programming abstractions Google supported and open sourced Do what the user expects. No knobs whenever possible. Build for extensibility. Unified batch & streaming semantics. Java 7 github.com/googlecloudplatform/dataflowjavasdk Python 2 (in progress) Community sourced github.com/darkjh/scalaflow github.com/jhlch/scala-dataflow-dsl
11 Dataflow Java SDK Release Process weekly monthly
12 Pipeline A directed graph of data processing transformations Optimized and executed as a unit May include multiple inputs and multiple outputs May encompass many logical MapReduce or Millwheel operations PCollections conceptually flow through the pipeline
13 Runners Specify how a pipeline should run Direct Runner Cloud Dataflow Service For local, in-memory execution. Great for developing and unit tests batch mode: GCE instances poll for work items to execute. streaming mode: GCE instances are set up in a semi-permanent topology Community sourced Spark from github.com/cloudera/spark-dataflow Flink from github.com/dataartisans/flink-dataflow
14 Example: #HashTag Autocompletion
15 Tweets Read ExtractTags {Go Hawks #Seahawks!, #Seattle works museum pass. Free! Go #PatriotsNation! Having fun at #seaside, } {seahawks, seattle, patriotsnation, lovemypats,...} Count {seahawks->5m, seattle->2m, patriots->9m,...} ExpandPrefixes {d->(deflategate, 10M), d->(denver, 2M),, sea->(seahawks, 5M), sea->(seaside, 2M),...} Top(3) Write Predictions {d->[deflategate, desafiodatransa, djokovic],... de->[deflategate, desafiodatransa, dead50],...}
16 Tweets Pipeline p = Pipeline.create(); p.begin() Read.apply(TextIO.Read.from( gs:// )) ExtractTags.apply(ParDo.of(new ExtractTags())) Count ExpandPrefixes.apply(Count.perElement()).apply(ParDo.of(new ExpandPrefixes()) Top(3).apply(Top.largestPerKey(3)) Write.apply(TextIO.Write.to( gs:// )); Predictions p.run();
17 Dataflow Basics Pipeline Directed graph of steps operating on data Pipeline p = Pipeline.create(); p.run();
18 Dataflow Basics Pipeline Directed graph of steps operating on data Data PCollection Pipeline p = Pipeline.create(); p.begin().apply(textio.read.from( gs:// )) Immutable collection of same-typed elements that can be encoded PCollectionTuple, PCollectionList.apply(TextIO.Write.to( gs:// )); p.run();
19 Dataflow Basics Pipeline Directed graph of steps operating on data Data PCollection Immutable collection of same-typed elements that can be encoded PCollectionTuple, PCollectionList Transformation Step that operates on data Core transforms ParDo, GroupByKey, Combine, Flatten Composite and custom transforms Pipeline p = Pipeline.create(); p.begin().apply(textio.read.from( gs:// )).apply(pardo.of(new ExtractTags())).apply(Count.perElement()).apply(ParDo.of(new ExpandPrefixes()).apply(Top.largestPerKey(3)).apply(TextIO.Write.to( gs:// )); p.run();
20 Dataflow Basics Pipeline p = Pipeline.create(); class ExpandPrefixes {... public void processelement(processcontext c) { String word = c.element().getkey(); for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); c.output(kv.of(prefix, c.element())); } } } p.begin().apply(textio.read.from( gs:// )).apply(pardo.of(new ExtractTags())).apply(Count.perElement()).apply(ParDo.of(new ExpandPrefixes()).apply(Top.largestPerKey(3)).apply(TextIO.Write.to( gs:// )); p.run();
21 PCollections A collection of data of type T in a pipeline Maybe be either bounded or unbounded in size Created by using a PTransform to: Build from a java.util.collection Read from a backing data store Transform an existing PCollection Often contain key-value pairs using KV<K, V> {Seahawks, NFC, Champions, Seattle,...} {..., NFC Champions #GreenBay, Green Bay #superbowl!,... #GoHawks,...}
22 Inputs & Outputs Your Source/Sink Here Read from standard Google Cloud Platform data sources GCS, Pub/Sub, BigQuery, Datastore,... Write your own custom source by teaching Dataflow how to read it in parallel Write to standard Google Cloud Platform data sinks GCS, BigQuery, Pub/Sub, Datastore, Can use a combination of text, JSON, XML, Avro formatted data
23 Coders A Coder<T> explains how an element of type T can be written to disk or communicated between machines Every PCollection<T> needs a valid coder in case the service decides to communicate those values between machines. Encoded values are used to compare keys -- need to be deterministic. Avro Coder inference can infer a coder for many basic Java objects.
24 ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} LowerCase {seahawks, nfc, champions, seattle,...}
25 ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn PCollection<String> tweets = ; tweets.apply(pardo.of( new DoFn<String, String>() public void processelement( ProcessContext c) { c.output(c.element().tolowercase()); })); {Seahawks, NFC, Champions, Seattle,...} LowerCase {seahawks, nfc, champions, seattle,...}
26 ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} FilterOutSWords {NFC, Champions,...}
27 ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} ExpandPrefixes {s, se, sea, seah, seaha, seahaw, seahawk, seahawks, n, nf, nfc, c, ch, cha, cham, champ, champi, champio, champion, champions, s, se, sea, seat, seatt, seattl, seattle,...}
28 ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} KeyByFirstLetter {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>,...}
29 ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn Elements are processed in arbitrary bundles e. g. shards startbundle(), processelement()*, finishbundle() supports arbitrary amounts of parallelization Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo {Seahawks, NFC, Champions, Seattle,...} KeyByFirstLetter {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>,...}
30 GroupByKey Takes a PCollection of key-value pairs and gathers up all values with the same key Corresponds to the shuffle phase in Hadoop How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>,...} GroupByKey {KV<S, {Seahawks, Seattle, }, KV<N, {NFC, } KV<C, {Champion, }}
31 Windows Nighttime Mid-Day Nighttime Logically divide up or groups the elements of a PCollection into finite windows Fixed Windows: hourly, daily, Sliding Windows Sessions Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections Window.into() can be called at any point in the pipeline and will be applied when needed Can be tied to arrival time or custom event time
32 Processing Time Event Time Skew Watermark Skew Event Time
33 Triggers Determine when to emit elements into an aggregated Window. Provide flexibility for dealing with time skew and data lag. Example use: Deal with late-arriving data. (Someone was in the woods playing Candy Crush offline.) Example use: Get early results, before all the data in a given window has arrived. (Want to know # users per hour, with updates every 5 minutes.)
34 Late & Speculative Results PCollection<KV<String, Long>> sums = Pipeline.begin().read( userrequests ).apply(window.into(fixedwindows.of(duration.standardminutes(2)).triggering( AfterEach.inOrder( Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane().alignedTo(Duration.standardMinutes(1))).orFinally(AfterWatermark.pastEndOfWindow()), Repeatedly.forever( AfterPane.elementCountAtLeast(1))).orFinally(AfterWatermark.pastEndOfWindow().plusDelayOf(Duration.standardDays(7)))).accumulatingFiredPanes()).apply(new Sum());
35 Composite PTransforms Define new PTransforms by building up subgraphs of existing transforms Some utilities are included in the SDK Count, RemoveDuplicates, Join, Min, Max, Sum,... You can define your own: modularity, code reuse better monitoring experience Pair With Ones Count GroupByKey Sum Values
36 Composite PTransforms
37 Cloud Dataflow Service
38 Life of a Pipeline Managed Service Job Manager Work Manager User Code & SDK Deploy & Schedule Progress & Logs Monitoring UI Google Cloud Platform
39 Cloud Dataflow Service Fundamentals Pipeline optimization: Modular code, efficient execution Smart Workers: Lifecycle management, Auto-Scaling, and Dynamic Work Rebalancing Easy Monitoring: Dataflow UI, Restful API and CLI, Integration with Cloud Logging, Cloud Debugger, etc.
40 Graph Optimization ParDo fusion Producer Consumer Sibling Intelligent fusion boundaries Combiner lifting e.g. partial aggregations before reduction Flatten unzipping Reshard placement...
41 = ParallelDo Optimizer: ParallelDo Fusion + GBK consumer-producer C D C+D sibling C D C+D = CombineValues = GroupByKey
42 = ParallelDo Optimizer: Combiner Lifting + GBK A GBK A+ + GBK B + B = CombineValues = GroupByKey
43
44
45 Worker Lifecycle Management Deploy Schedule & Monitor Tear Down
46 Worker Pool Auto-Scaling 800 RPS 1,200 RPS 5,000 RPS 50 RPS
47 Dynamic Work Rebalancing 100 mins. vs. 65 mins.
48 Monitoring
49
50 Fully-managed Service Pipeline management Validation Pipeline execution graph optimizations Dynamic and adaptive sharding of computation stages Monitoring UI Cloud resource management Spin worker VMs Set up logging Manages exports Teardown
51 Benefits of Dataflow on Google Cloud Platform Ease of use No performance tuning required Highly scalable, performant out of the box Novel techniques to lower e2e execution time Intuitive programming model + Java SDK No dichotomy between batch and streaming processing Integrated with GCP (VMs, GCS, BigQuery, Datastore, ) Total Cost of Ownership Benefits from GCE s cost model
52 Optimizing your time: no-ops, no-knobs, zero-config Programming Monitoring Resource provisioning Performance tuning Handling Growing Scale Utilization improvements Deployment & configuration Programming More time to dig into your data Reliability Typical Data Processing Data Processing with Google Cloud Dataflow
53 Demo: Counting Words!
54 Highlights from the live demo... WordCount Code: See the SDK Concepts in action Running on the Dataflow Service Monitoring job progress in the Dataflow Monitoring UI Looking at worker logs in Cloud Logging Using the CLI
55 Questions and Discussion
56 Getting Started cloud.google.com/dataflow/getting-started github.com/googlecloudplatform/dataflowjavasdk stackoverflow.com/questions/tagged/google-cloud-dataflow
Big Data on Google Cloud
Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud way William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / [email protected]
FlumeJava: Easy, Efficient Data-Parallel Pipelines
FlumeJava: Easy, Efficient Data-Parallel Pipelines Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum Google, Inc. {chambers,raniwala,fjp,sra,rrh,robertwb,[email protected]
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
The Cloud to the rescue!
The Cloud to the rescue! What the Google Cloud Platform can make for you Aja Hammerly, Developer Advocate twitter.com/thagomizer_rb So what is the cloud? The Google Cloud Platform The Google Cloud Platform
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Creating Big Data Applications with Spring XD
Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont THE FASTEST PATH TO NEW BUSINESS VALUE Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer
Automated Data Ingestion Bernhard Disselhoff Enterprise Sales Engineer Agenda Pentaho Overview Templated dynamic ETL workflows Pentaho Data Integration (PDI) Use Cases Pentaho Overview Overview What we
Google Cloud Data Platform & Services. Gregor Hohpe
Google Cloud Data Platform & Services Gregor Hohpe All About Data We Have More of It Internet data more easily available Logs user & system behavior Cheap Storage keep more of it 3 Beyond just Relational
Cloud Big Data Architectures
Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016 About this Workshop Real-world Cloud Scenarios w/aws, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011
Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011 The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Powering the Future with Google Cloud Platform. Brad Abrams @brada http://bradabrams.com/+
Powering the Future with Google Brad Abrams @brada http://bradabrams.com/+ Exploring the Cloud IaaS PaaS SaaS Infrastructure-as-a-Service Platform-as-a-Service Software-as-a-Service Google Organize the
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Deploying complex applications to Google Cloud. Olia Kerzhner [email protected]
Deploying complex applications to Google Cloud Olia Kerzhner [email protected] Cloud VMs Networks Databases Object Stores Firewalls Disks LoadBalancers Control..? Application stacks are complex Storage External
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS
APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS This article looks into the benefits of using the Platform as a Service paradigm to develop applications on the cloud. It also compares a few top PaaS providers
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Introduction to Cloud Computing
Introduction to Cloud Computing Summary Need for Cloud computing Cloud computing Architecture Cloud Services Possible challenges related to parallel processing Wolfson et al optimal data replication strategy
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform
Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform Simon Wu! HTC (Prior: Twitter & Microsoft)! Edward Chang 張 智 威 HTC (prior: Google & U. California)! 1/26/2015
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
Cloud3DView: Gamifying Data Center Management
Cloud3DView: Gamifying Data Center Management Yonggang Wen Assistant Professor School of Computer Engineering Nanyang Technological University [email protected] November 26, 2013 School of Computer Engineering
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
Simplifying Big Data with Apache Crunch. Micah Whitacre @mkwhit
Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit Semantic Chart Search Medical Alerting System Cloud Based EMR Population Health Management Problem moves from scaling architecture...
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS
. 3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade
Apigee Edge API Services Manage, scale, secure, and build APIs and apps
Manage, scale, secure, and build APIs and apps Hex #FC4C02 Hex #54585A Manage, scale, secure, and build APIs and Apps with is designed to unite the best of Internet and enterprise technologies to provide
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica [email protected] Big
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing Andre Luckow, Peter M. Kasson, Shantenu Jha STREAMING 2016, 03/23/2016 RADICAL, Rutgers, http://radical.rutgers.edu
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Future Internet Technologies
Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
HADOOP BIG DATA DEVELOPER TRAINING AGENDA
HADOOP BIG DATA DEVELOPER TRAINING AGENDA About the Course This course is the most advanced course available to Software professionals This has been suitably designed to help Big Data Developers and experts
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Safe Harbor Statement
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Big Data and Analytics in Government
Big Data and Analytics in Government Nov 29, 2012 Mark Johnson Director, Engineered Systems Program 2 Agenda What Big Data Is Government Big Data Use Cases Building a Complete Information Solution Conclusion
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Kafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
Getting Started with Google Cloud Platform
Chapter 2 Getting Started with Google Cloud Platform Welcome to Google Cloud Platform! Cloud Platform is a set of modular cloud-based services that provide building blocks you can use to develop everything
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
