A Brief Introduction of Existing Big Data Tools. Bingjing Zhang
|
|
|
- Warren Atkins
- 9 years ago
- Views:
Transcription
1 A Brief Introduction of Existing Big Data Tools Bingjing Zhang
2 Outline The world map of big data tools Layered architecture Big data tools for HPC and supercomputing MPI Big data tools on clouds MapReduce model Iterative MapReduce model DAG model Graph model Collective model Machine learning on big data Query on big data Stream data processing
3 The World of Big Data Tools DAG Model MapReduce Model Graph Model BSP/Collective Model Hadoop For Iterations/ Learning For Query Dryad/ DryadLINQ Drill Spark Pig/PigLatin Hive Tez Shark HaLoop Twister Harp Stratosphere Reef MRQL MPI Giraph Hama GraphLab GraphX For Streaming S4 Samza Storm Spark Streaming
4 Layered Architecture (Upper) NA Non Apache projects Green layers are Apache/Commercial Cloud (light) to HPC (darker) integration layers Orchestration & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBPEL, BioKepler, Galaxy Cross Cutting Capabilities Monitoring: Ambari, Ganglia, Nagios, Inca (NA) Security & Privacy Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) Machine Learning Mahout, MLlib, MLbase CompLearn (NA) Hive (SQL on Hadoop) Hcatalog Interfaces Data Analytics Libraries: Statistics, Bioinformatics R, Bioconductor (NA) High Level (Integrated) Systems for Data Processing Pig (Procedural Language) Shark (SQL on Spark, NA) Parallel Horizontally Scalable Data Processing Hadoop Spark NA:Twister Tez Hama (Map (Iterative Storm Stratosphere (DAG) (BSP) Reduce) MR) Iterative MR Batch ABDS Inter-process Communication Hadoop, Spark Communications & Reductions Harp Collectives (NA) Pub/Sub Messaging Imagery ImageJ (NA) MRQL (SQL on Hadoop, Hama, Spark) S4 Yahoo Stream Linear Algebra Scalapack, PetSc (NA) Impala (NA) Cloudera (SQL on Hbase) Samza LinkedIn Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka Giraph ~Pregel Swazall (Log Files Google NA) Graph HPC Inter-process Communication MPI (NA) Pegasus on Hadoop (NA) The figure of layered architecture is from Prof. Geoffrey Fox
5 Layered Architecture (Lower) NA Non Apache projects Green layers are Apache/Commercial Cloud (light) to HPC (darker) integration layers Cross Cutting Capabilities Monitoring: Ambari, Ganglia, Nagios, Inca (NA) Security & Privacy Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis(NA) (key value), Hazelcast (NA), Ehcache (NA); ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard Extraction Tools SQL SciDB NoSQL: Column Solandra UIMA (NA) Tika MySQL Phoenix HBase Accumulo (Solr+ (Entities) (SQL on Arrays, Cassandra (Content) (NA) (Data on (Data on Cassandra) (Watson) R,Python (DHT) HBase) HDFS) HDFS) +Document NoSQL: Document MongoDB CouchDB (NA) NoSQL: General Graph Neo4J Java Gnu (NA) Yarcdata Commercial (NA) Lucene Solr ABDS Cluster Resource Management Mesos, Yarn, Helix, Llama(Cloudera) NoSQL: Key Value (all NA) Berkeley DB Azure Table Dynamo Amazon NoSQL: TripleStore RDF SparkQL Jena Sesame (NA) AllegroGraph Commercial Riak ~Dynamo HPC Cluster Resource Management Condor, Moab, Slurm, Torque(NA).. ABDS File Systems User Level HPC File Systems (NA) HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated Interoperability Layer Whirr / JClouds OCCI CDMI (NA) DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) Voldemort ~Dynamo File Management RYA RDF on Accumulo irods(na) Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP) IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vcloud, Amazon, Azure, Google Bare Metal The figure of layered architecture is from Prof. Geoffrey Fox
6 Big Data Tools for HPC and Supercomputing MPI(Message Passing Interface, 1992) Provide standardized function interfaces for communication between parallel processes. Collective communication operations Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce-scatter. Popular implementations MPICH (2001) OpenMPI (2004)
7 MapReduce Model Google MapReduce (2004) Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI Apache Hadoop (2005) Apache Hadoop 2.0 (2012) Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC Separation between resource management and computation model.
8 Key Features of MapReduce Model Designed for clouds Large clusters of commodity machines Designed for big data Support from local disks based distributed file system (GFS / HDFS) Disk based intermediate data transfer in Shuffling MapReduce programming model Computation pattern: Map tasks and Reduce tasks Data abstraction: KeyValue pairs
9 Google MapReduce Mapper: split, read, emit intermediate KeyValue pairs Split 0 Split 1 Split 2 Worker (3) read (4) local write Worker Worker (1) fork (1) fork (1) fork (2) assign map User Program Master (2) assign reduce Reducer: repartition, emits final output Worker Worker (5) remote read (6) write Output File 0 Output File 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files
10 Iterative MapReduce Model Twister (2010) Jaliya Ekanayake et al. Twister: A Runtime for Iterative MapReduce. HPDC workshop Simple collectives: broadcasting and aggregation. HaLoop (2010) Yingyi Bu et al. HaLoop: Efficient Iterative Data Processing on Large clusters. VLDB Programming model RR ii+1 = RR 0 RR ii LL Loop-Aware Task Scheduling Caching and indexing for Loop-Invariant Data on local disk
11 Twister Programming Model Main program s process space Worker Nodes configuremaps( ) configurereduce( ) Local Disk while(condition){ Cacheable map/reduce tasks runmapreduce(...) May scatter/broadcast <Key,Value> Map() Iterations pairs directly May merge data in shuffling Reduce() Combine() Communications/data transfers via the operation pub-sub broker network & direct TCP updatecondition() } //end while close() Main program may contain many MapReduce invocations or iterative MapReduce invocations
12 DAG (Directed Acyclic Graph) Model Dryad and DryadLINQ (2007) Michael Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, EuroSys,
13 Model Composition Apache Spark (2010) Matei Zaharia et al. Spark: Cluster Computing with Working Sets,. HotCloud Matei Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI Resilient Distributed Dataset (RDD) RDD operations MapReduce-like parallel operations DAG of execution stages and pipelined transformations Simple collectives: broadcasting and aggregation
14 Graph Processing with BSP model Pregel (2010) Grzegorz Malewicz et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD Apache Hama (2010) Apache Giraph (2012) Scaling Apache Giraph to a trillion edges
15 Pregel & Apache Giraph Computation Model Superstep as iteration Vertex state machine: Active and Inactive, vote to halt Message passing between vertices Combiners Aggregators Topology mutation Master/worker model Graph partition: hashing Fault tolerance: checkpointing and confined recovery Vote to halt Maximum Value Example Active Superstep 0 Superstep 1 Superstep 2 Superstep 3
16 Giraph Page Rank Code Example public class PageRankComputation extends BasicComputation<IntWritable, FloatWritable, NullWritable, FloatWritable> { /** Number of supersteps */ public static final String SUPERSTEP_COUNT = public void compute(vertex<intwritable, FloatWritable, NullWritable> vertex, Iterable<FloatWritable> messages) throws IOException { if (getsuperstep() >= 1) { float sum = 0; for (FloatWritable message : messages) { sum += message.get(); } vertex.getvalue().set((0.15f / gettotalnumvertices()) f * sum); } if (getsuperstep() < getconf().getint(superstep_count, 0)) { sendmessagetoalledges(vertex, new FloatWritable(vertex.getValue().get() / vertex.getnumedges())); } else { vertex.votetohalt(); } } } Deployment and Execution
17 GraphLab (2010) Yucheng Low et al. GraphLab: A New Parallel Framework for Machine Learning. UAI Yucheng Low, et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB Data graph Update functions and the scope Sync operation (similar to aggregation in Pregel)
18 Data Graph
19 Vertex-cut v.s. Edge-cut PowerGraph (2012) Joseph E. Gonzalez et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI Gather, apply, Scatter (GAS) model GraphX (2013) Reynold Xin et al. GraphX: A Resilient Distributed Graph System on Spark. GRADES (SIGMOD workshop) n/graphx-grades/ Edge-cut (Giraph model) Vertex-cut (GAS model)
20 To reduce communication overhead. Option 1 Algorithmic message reduction Fixed point-to-point communication pattern Option 2 Collective communication optimization Not considered by previous BSP model but well developed in MPI Initial attempts in Twister and Spark on clouds Mosharaf Chowdhury et al. Managing Data Transfers in Computer Clusters with Orchestra. SIGCOMM Bingjing Zhang, Judy Qiu. High Performance Clustering of Social Images in a Map- Collective Programming Model. SOCC Poster 2013.
21 Collective Model Harp (2013) Hadoop Plugin (on Hadoop and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing
22 Harp Design Parallelism Model Architecture MapReduce Model Map-Collective Model Application MapReduce Applications Map-Collective Applications M M M M Shuffle M M M M Collective Communication Framework MapReduce V2 Harp R R Resource Manager YARN
23 Hierarchical Data Abstraction and Collective Communication Table Array Table <Array Type> Edge Table Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to-Vertex Message Vertex KeyValue Table Table Table Partition Array Partition < Array Type > Edge Partition Message Partition Vertex Partition KeyValue Partition Broadcast, Send Long Array Int Array Double Array Byte Array Vertices, Edges, Messages Key-Values Basic Types Array Struct Object Broadcast, Send, Gather Commutable
24 Harp Bcast Code Example protected void mapcollective(keyvalreader reader, Context context) throws IOException, InterruptedException { ArrTable<DoubleArray, DoubleArrPlus> table = new ArrTable<DoubleArray, DoubleArrPlus>(0, DoubleArray.class, DoubleArrPlus.class); if (this.ismaster()) { String cfile = conf.get(kmeansconstants.cfile); Map<Integer, DoubleArray> cendatamap = createcendatamap(cparsize, rest, numcenpartitions, vectorsize, this.getresourcepool()); loadcentroids(cendatamap, vectorsize, cfile, conf); addpartitionmaptotable(cendatamap, table); } } arrtablebcast(table);
25 Pipelined Broadcasting with Topology-Awareness Twister vs. MPI (Broadcasting 0.5~2GB data) Twister vs. MPJ (Broadcasting 0.5~2GB data) Number of Nodes Twister Bcast 500MB MPI Bcast 500MB Twister Bcast 1GB MPI Bcast 1GB Twister Bcast 2GB MPI Bcast 2GB Number of Nodes Twister 0.5GB MPJ 0.5GB Twister 1GB MPJ 1GB Twister 2GB Twister vs. Spark (Broadcasting 0.5GB data) Number of Nodes 1 receiver #receivers = #nodes #receivers = #cores (#nodes*8) Twister Chain Twister Chain with/without topology-awareness Number of Nodes 0.5GB 0.5GB W/O TA 1GB 1GB W/O TA 2GB 2GB W/O TA Tested on IU Polar Grid with 1 Gbps Ethernet connection
26 K-Means Clustering Performance on Madrid Cluster (8 nodes) 1600 K-Means Clustering Harp v.s. Hadoop on Madrid Execution Time (s) m m 5k 1m 50k Problem Size Hadoop 24 cores Harp 24 cores Hadoop 48 cores Harp 48 cores Hadoop 96 cores Harp 96 cores
27 K-means Clustering Parallel Efficiency Shantenu Jha et al. A Tale of Two Data- Intensive Paradigms: Applications, Abstractions, and Architectures
28 WDA-MDS Performance on Big Red II WDA-MDS Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. IEEE e-dcience Big Red II Allgather Bucket algorithm Allreduce Bidirectional exchange algorithm
29 Execution Time of 100k Problem 3000 Execution Time of 100k Problem 2500 Execution Time (Seconds) Number of Nodes (8, 16, 32, 64, 128 nodes, 32 cores per node)
30 Parallel Efficiency Based On 8 Nodes and 256 Cores 1.2 Parallel Efficiency (Based On 8 Nodes and 256 Cores) Number of Nodes (8, 16, 32, 64, 128 nodes) 4096 partitions (32 cores per node)
31 Scale Problem Size (100k, 200k, 300k) 3500 Scaling Problem Size on 128 nodes with 4096 cores Execution Time (Seconds) Problem Size
32 Machine Learning on Big Data Mahout on Hadoop MLlib on Spark GraphLab Toolkits GraphLab Computer Vision Toolkit
33 Query on Big Data Query with procedural language Google Sawzall (2003) Rob Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure Apache Pig (2006) Christopher Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD
34 SQL-like Query Apache Hive (2007) Facebook Data Infrastructure Team. Hive - A Warehousing Solution Over a Map- Reduce Framework. VLDB On top of Apache Hadoop Shark (2012) Reynold Xin et al. Shark: SQL and Rich Analytics at Scale. Technical Report. UCB/EECS On top of Apache Spark Apache MRQL (2013) On top of Apache Hadoop, Apache Hama, and Apache Spark
35 Other Tools for Query Apache Tez (2013) To build complex DAG of tasks for Apache Pig and Apache Hive On top of YARN Dremel (2010) Apache Drill (2012) Sergey Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB System for interactive query
36 Stream Data Processing Apache S4 (2011) Apache Storm (2011) Spark Streaming (2012) Apache Samza (2013)
37 REEF Retainable Evaluator Execution Framework Provides system authors with a centralized (pluggable) control flow Embeds a user-defined system controller called the Job Driver Event driven control Package a variety of data-processing libraries (e.g., high-bandwidth shuffle, relational operators, low-latency group communication, etc.) in a reusable form. To cover different models such as MapReduce, query, graph processing and stream data processing
38 Thank You! Questions?
arp: Collective Communication on Hadoop Judy Qiu, Indiana University
arp: Collective Communication on Hadoop Judy Qiu, Indiana University Outline Machine Learning on Big Data Big Data Tools Iterative MapReduce model MDS Demo Harp Machine Learning on Big Data Mahout on Hadoop
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing Andre Luckow, Peter M. Kasson, Shantenu Jha STREAMING 2016, 03/23/2016 RADICAL, Rutgers, http://radical.rutgers.edu
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
The State of Big Data: Use Cases and the Ogre patterns
The State of Big Data: Use Cases and the Ogre patterns NIST Big Data Public Working Group IEEE Big Data Workshop October 27, 2014 Geoffrey Fox, Digital Science Center, Indiana University, [email protected]
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Apache Hadoop Ecosystem
Apache Hadoop Ecosystem Rim Moussa ZENITH Team Inria Sophia Antipolis DataScale project [email protected] Context *large scale systems Response time (RIUD ops: one hit, OLTP) Time Processing (analytics:
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Beyond Batch Processing: Towards Real-Time and Streaming Big Data
Beyond Batch Processing: Towards Real-Time and Streaming Big Data Saeed Shahrivari, and Saeed Jalili Computer Engineering Department, Tarbiat Modares University (TMU), Tehran, Iran [email protected],
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
YARN Apache Hadoop Next Generation Compute Platform
YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department
Pla7orms for Big Data Management and Analysis Michael J. Carey Informa(on Systems Group UCI CS Department Outline Big Data Pla6orm Space The Big Data Era Brief History of Data Pla6orms Dominant Pla6orms
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Scalable Cloud Computing
Keijo Heljanko Department of Information and Computer Science School of Science Aalto University [email protected] 1/44 Business Drivers of Cloud Computing Large data centers allow for economics
6.S897 Large-Scale Systems
6.S897 Large-Scale Systems Instructor: Matei Zaharia" Fall 2015, TR 2:30-4, 34-301 bit.ly/6-s897 Outline What this course is about" " Logistics" " Datacenter environment What this Course is About Large-scale
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
Introduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Map-Based Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
Survey of Apache Big Data Stack
Survey of Apache Big Data Stack Supun Kamburugamuve For the PhD Qualifying Exam 12/16/2013 Advisory Committee Prof. Geoffrey Fox Prof. David Leake Prof. Judy Qiu Table of Contents 1. Introduction... 3
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which
Why Spark on Hadoop Matters
Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
Twister4Azure: Data Analytics in the Cloud
Twister4Azure: Data Analytics in the Cloud Thilina Gunarathne, Xiaoming Gao and Judy Qiu, Indiana University Genome-scale data provided by next generation sequencing (NGS) has made it possible to identify
How to Hadoop Without the Worry: Protecting Big Data at Scale
How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Introduction to Hadoop
Introduction to Hadoop ID2210 Jim Dowling Large Scale Distributed Computing In #Nodes - BitTorrent (millions) - Peer-to-Peer In #Instructions/sec - Teraflops, Petaflops, Exascale - Super-Computing In #Bytes
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Dominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
