An Open Source Memory-Centric Distributed Storage System
|
|
- Leona Price
- 8 years ago
- Views:
Transcription
1 An Open Source Memory-Centric Distributed Storage System Haoyuan Li, Tachyon Nexus September 30, Strata and Hadoop World NYC 2015
2 Outline Open Source Introduction to Tachyon New Features Getting Involved 2
3 Outline Open Source Introduction to Tachyon New Features Getting Involved 3
4 History Started at UC Berkeley AMPLab From summer 2012 Same lab produced Apache Spark and Apache Mesos Open sourced April 2013 Apache License 2.0 Latest Release: Version (August 2015) Deployed at > 100 companies 4
5 111 Contributors Growth v0.1 Dec 12 v0.2 Apr 13 v0.3! Oct 13 v0.4! Feb 14 v0.5! Jul 14 v0.6! Mar 15 v0.7! Jul 15 5
6 Contributors Growth > 150 Contributors (3x increment over the last Strata NYC) > 50 Organizations 6
7 Contributors Growth One of the Fastest Growing Big Data Open Source Project 7
8 Thanks to Contributors and Users! 8
9 One Tachyon Production Deployment Example Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap) Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 9
10 Outline Open Source Introduction to Tachyon New Features Getting Involved 10
11 Tachyon is an Open Source Memory-centric Distributed Storage System 11
12 Why Tachyon? 12
13 Performance Trend: Memory is Fast RAM throughput increasing exponentially Disk throughput increasing slowly Memory-locality key to interactive response times 13
14 Price Trend: Memory is Cheaper source: jcmit.com 14
15 Realized by many 15
16 Is the Problem Solved? 16
17 Missing a Solution for the Storage Layer 17
18 A Use Case Example with - Fast, in-memory data processing framework Keep one in-memory copy inside JVM Track lineage of operations used to derive data Upon failure, use lineage to recompute data map Lineage Tracking join reduce filter map 18
19 Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 19
20 Issue 1 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes) block 1 block 3 Spark Job Spark mem block manager Hadoop MR Job YARN block 1 block 3 block 2 block 4 HDFS / Amazon S3 20
21 Issue 1 resolved with Tachyon Memory-speed data sharing among jobs in different execution engine & storage engine same process (fast writes) frameworks Spark Job Spark mem Hadoop MR Job YARN block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 21
22 Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 Spark Task Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 22
23 Issue 2 Cache loss when process crashes execution engine & storage engine same process block 1 block 3 crash Spark memory block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 23
24 Issue 2 Cache loss when process crashes execution engine & storage engine same process crash block 1 block 3 block 2 block 4 HDFS / Amazon S3 24
25 Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process Spark Task Spark memory block manager block 1 block 3 block 2 block 4 Tachyon! HDFS / Amazon S3 in-memory 25
26 Issue 2 resolved with Tachyon Keep in-memory data safe, even when a job crashes. execution engine & storage engine same process crash block 11 block 33 block 2 block 44 Tachyon! HDFS in-memory disk block 1 block 3 block 2 block 4 HDFS / Amazon S3 26
27 Issue 3 In-memory Data Duplication & Java Garbage Collection execution engine & storage engine same process (duplication & GC) Spark Job1 block 1 block 3 Spark mem block manager Spark Job2 block 3 block 1 Spark mem block manager block 1 block 3 block 2 block 4 HDFS / Amazon S3 27
28 Issue 3 resolved with Tachyon No in-memory data duplication, much less GC execution engine & storage engine same process (no duplication & GC) Spark Job1 Spark mem Spark Job2 Spark mem block 11 block 33 block 1 block 3 block 2 block 44 block 2 block 4 Tachyon! HDFS disk in-memory HDFS / Amazon S3 28
29 Previously Mentioned A memory-centric storage architecture Push lineage down to storage layer 29
30 Tachyon Memory-Centric Architecture 30
31 Tachyon Memory-Centric Architecture 31
32 Lineage in Tachyon 32
33 Outline Open Source Introduction to Tachyon New Features Getting Involved 33
34 1) Eco-system: Enable new workload in any storage; Work with the framework of your choice; 34
35 2) Tachyon running in production environment, both in the Cloud and on Premise. 35
36 Use Case: Baidu Framework: SparkSQL Under Storage: Baidu s File System Storage Media: MEM + HDD 100+ nodes deployment 1PB+ managed space 30x Performance Improvement 36
37 Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 37
38 Use Case: an Oil Company Framework: Spark Under Storage: GlusterFS Storage Media: MEM only Analyzing data in traditional storage 38
39 Use Case: a SAAS Company Framework: Spark Under Storage: S3 Storage Media: SSD only Elastic Tachyon deployment 39
40 What if data size exceeds memory capacity? 40
41 3) Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 41
42 Configurable Storage Tiers MEM only MEM + HHD SSD only 42
43 4) Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 43
44 Pin Data in Memory 44
45 5) Transparent Naming 45
46 6) Unified Namespace 46
47 More Features 7) Remote Write Support 8) Easy deployment with Mesos and Yarn 9) Initial Security Support 10) One Command Cluster Deployment 11) Metrics Reporting for Clients, Workers, and Master 47
48 12) More Under Storage Supports 48
49 Reported Tachyon Usage 49
50 Outline Open Source Introduction to Tachyon New Features Getting Involved 50
51 Memory-Centric Distributed Storage Welcome to try, contact, and collaborate! JIRA New Contributor Tasks 51
52 Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source 52
53 53
54 Strata NYC 2015 Welcome to visit us at our booth #P18. Check out other Tachyon related talks. First-ever scalable, distributed deep learning architecture using Spark and Tachyon Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) 2:05pm 2:45pm Thursday, 10/01/2015 Faster time to insight using Spark, Tachyon, and Zeppelin Nirmal Ranganathan (Rackspace Hosting) 2:05pm 2:45pm Thursday, 10/01/
55 Try Tachyon: Develop Tachyon: Meet Friends: Get News: Tachyon Nexus: Contact us: 55
Tachyon: memory-speed data sharing
Tachyon: memory-speed data sharing Ali Ghodsi, Haoyuan (HY) Li, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley Memory trumps everything else RAM throughput increasing exponentially Disk throughput
More informationTachyon: A Reliable Memory Centric Storage for Big Data Analytics
Tachyon: A Reliable Memory Centric Storage for Big Data Analytics a Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica June 30 th, 2014 Spark Summit @ San Francisco UC Berkeley Outline
More informationTachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks Haoyuan Li UC Berkeley Outline Motivation System Design Evaluation Results Release Status Future Directions Outline Motivation
More informationA Reliable Memory-Centric Distributed Storage System a
A Reliable Memory-Centric Distributed Storage System a Haoyuan Li October 16 @ Strata & Hadoop World NYC Website: tachyon-project.org Meetup: www.meetup.com/tachyon UC Berkeley Outline Overview Feature
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationAli Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
More informationNext-Gen Big Data Analytics using the Spark stack
Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our
More informationApache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org @apacheignite @dsetrakyan Agenda About In- Memory
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationMambo Running Analytics on Enterprise Storage
Mambo Running Analytics on Enterprise Storage Jingxin Feng, Xing Lin 1, Gokul Soundararajan Advanced Technology Group 1 University of Utah Motivation No easy way to analyze data stored in enterprise storage
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine
More informationTE's Analytics on Hadoop and SAP HANA Using SAP Vora
TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -
More informationConquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
More informationWhat s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
More informationScientific Computing Meets Big Data Technology: An Astronomy Use Case
Scientific Computing Meets Big Data Technology: An Astronomy Use Case Zhao Zhang AMPLab and BIDS UC Berkeley zhaozhang@cs.berkeley.edu In collaboration with Kyle Barbary, Frank Nothaft, Evan Sparks, Oliver
More informationHDFS 2015: Past, Present, and Future
Apache: Big Data Europe 2015 HDFS 2015: Past, Present, and Future 9/30/2015 NTT DATA Corporation Akira Ajisaka Copyright 2015 NTT DATA Corporation Self introduction Akira Ajisaka (NTT DATA) Apache Hadoop
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationAccelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera
Accelerating Enterprise Big Data Success Tim Stevens, VP of Business and Corporate Development Cloudera 1 Big Opportunity: Extract value from data Revenue Growth x = 50 Billion 35 ZB Cost Savings Margin
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationArchitecture & Experience
Architecture & Experience Data Mining - Combination from SAP HANA, R & Hadoop Markus Severin, Solution Principal Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein
More informationProcessing NGS Data with Hadoop-BAM and SeqPig
Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3
More informationSolving Big Data Problems: Storage to the Rescue? PRESENTATION TITLE GOES HERE John Webster Evaluator Group
Solving Big ata Problems: Storage to the Rescue? PRSTATI TITL GS HR John Webster valuator Group Agenda Big ata Analytics Storage Maxims The Fundamental JB and AS Architecture verview of isk-based Alternatives
More informationDataStax Enterprise, powered by Apache Cassandra (TM)
PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies
More informationCisco Data Preparation
Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and
More informationUnstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
More informationMesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationBig Data Performance Growth on the Rise
Impact of Big Data growth On Transparent Computing Michael A. Greene Intel Vice President, Software and Services Group, General Manager, System Technologies and Optimization 1 Transparent Computing (TC)
More informationBig Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More informationSignificantly Speed up real world big data Applications using Apache Spark
Significantly Speed up real world big data Applications using Apache Spark Mingfei Shi(mingfei.shi@intel.com) Grace Huang ( jie.huang@intel.com) Intel/SSG/Big Data Technology 1 Agenda Who are we? Case
More informationBig Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
More informationSpark: Making Big Data Interactive & Real-Time
Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency
More informationCapitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationJun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design
More informationBerkeley Data Analytics Stack:! Experience and Lesson Learned
UC BERKELEY Berkeley Data Analytics Stack:! Experience and Lesson Learned Ion Stoica UC Berkeley, Databricks, Conviva Research Philosophy Follow real problems Focus on novel usage scenarios Build real
More informationSurvey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationxpaaerns on Spark, Shark, Tachyon and Mesos
xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu
More informationOpen Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
More informationDataStax Enterprise Reference Architecture
DataStax Enterprise Reference Architecture DataStax Enterprise Reference Architecture 7.8.15 1 Table of Contents ABSTRACT... 3 INTRODUCTION... 3 DATASTAX ENTERPRISE... 3 ARCHITECTURE... 3 OPSCENTER: EASY-
More informationUpcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
More informationHierarchy storage in Tachyon. Jie.huang@intel.com, haoyuan.li@gmail.com, mingfei.shi@intel.com
Hierarchy storage in Tachyon Jie.huang@intel.com, haoyuan.li@gmail.com, mingfei.shi@intel.com Hierarchy storage in Tachyon... 1 Introduction... 1 Design consideration... 2 Feature overview... 2 Usage design...
More informationThe Berkeley AMPLab - Collaborative Big Data Research
The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013 About Me Education: MIT SB, MS, PhD Joined Univ. of California, Berkeley in 1998 Current
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationApache Spark and the future of big data applica5ons. Eric Baldeschwieler
Apache Spark and the future of big data applica5ons Eric Baldeschwieler Who is Eric14? Big data veteran (since 1996) Databricks Tech Advisor Twitter handle: @jeric14 Previously CTO/CEO of Hortonworks Yahoo
More informationThe Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions
The Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions Flash Memory Summit 5-7 August 2014 1 Forward-Looking
More informationHDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
More informationData Lake In Action: Real-time, Closed Looped Analytics On Hadoop
1 Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 2 Pivotal s Full Approach It s More Than Just Hadoop Pivotal Data Labs 3 Why Pivotal Exists First Movers Solve the Big Data Utility Gap
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationPlatfora Big Data Analytics
Platfora Big Data Analytics ISV Partner Solution Case Study and Cisco Unified Computing System Platfora, the leading enterprise big data analytics platform built natively on Hadoop and Spark, delivers
More informationBig Data Trends and HDFS Evolution
Big Data Trends and HDFS Evolution Sanjay Radia Founder & Architect Hortonworks Inc Page 1 Hello Founder, Hortonworks Part of the Hadoop team at Yahoo! since 2007 Chief Architect of Hadoop Core at Yahoo!
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationBeyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationPulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationFederated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationThe Flash-Transformed Financial Data Center. Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014
The Flash-Transformed Financial Data Center Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014 Forward-Looking Statements During our meeting today we will
More informationFast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationDatenverwaltung im Wandel - Building an Enterprise Data Hub with
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
More informationvrops Microsoft SQL Server MANAGEMENT PACK OVERVIEW
vrops Microsoft SQL Server MANAGEMENT PACK OVERVIEW What does Blue Medora do? We connect business critical applications, databases, storage, and converged systems to leading virtualization and cloud management
More informationINTRODUCING APACHE IGNITE An Apache Incubator Project
WHITE PAPER BY GRIDGAIN SYSTEMS FEBRUARY 2015 INTRODUCING APACHE IGNITE An Apache Incubator Project COPYRIGHT AND TRADEMARK INFORMATION 2015 GridGain Systems. All rights reserved. This document is provided
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationNextGen Infrastructure for Big DATA Analytics.
NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationHow To Choose A Data Flow Pipeline From A Data Processing Platform
S N A P L O G I C T E C H N O L O G Y B R I E F SNAPLOGIC BIG DATA INTEGRATION PROCESSING PLATFORMS 2 W Fifth Avenue Fourth Floor, San Mateo CA, 94402 telephone: 888.494.1570 www.snaplogic.com Big Data
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationCan t We All Just Get Along? Spark and Resource Management on Hadoop
Can t We All Just Get Along? Spark and Resource Management on Hadoop Introduc=ons So>ware engineer at Cloudera MapReduce, YARN, Resource management Hadoop commider Introduc=on Spark as a first class data
More informationExtended Attributes and Transparent Encryption in Apache Hadoop
Extended Attributes and Transparent Encryption in Apache Hadoop Uma Maheswara Rao G Yi Liu ( 刘 轶 ) Who we are? Uma Maheswara Rao G - umamahesh@apache.org - Software Engineer at Intel - PMC/committer, Apache
More informationIBM Power Systems This is Power on a Smarter Planet
IBM Power Systems This is Power on a Smarter Planet Red Hat Enterprise Linux for IBM Power Systems! Filipe Miranda Global Lead for Linux on IBM System z and Power Systems!, #powerlinux, #bigdata, #IBMWatson,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationPrivate Cloud Storage for Media Applications. Bang Chang Vice President, Broadcast Servers and Storage bang.chang@xor-media.com
Private Cloud Storage for Media Bang Chang Vice President, Broadcast Servers and Storage bang.chang@xor-media.com Table of Contents Introduction Cloud Storage Requirements Application transparency Universal
More informationYARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing
YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing Eric Charles [http://echarles.net] @echarles Datalayer [http://datalayer.io] @datalayerio FOSDEM 02 Feb 2014 NoSQL DevRoom
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationHDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
More informationCollaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationApache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory
More informationTHE JOURNEY TO A DATA LAKE
THE JOURNEY TO A DATA LAKE 1 THE JOURNEY TO A DATA LAKE 85% OF DATA GROWTH BY 2020 WILL COME FROM NEW TYPES OF DATA ACCORDING TO IDC, AS MUCH AS 85% OF DATA GROWTH BY 2020 WILL COME FROM NEW TYPES OF DATA,
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationBig Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
More informationHigher Education and The Cloud
Higher Education and The Cloud Vince Kellen CIO, University of Kentucky Vince.Kellen@uky.edu December 14, 2011 First, some IT facts of life 2 Server Server Hugger Server Hugger Trainee 3 What is this about
More informationDot Hill Storage Systems and the Advantages of Hybrid Arrays
Maximizing the Benefits from Flash In Your SAN Storage Systems Dot Hill Storage Hybrid Arrays integrate flash and HDDs in Optimal Configurations Maximizing the Benefits of Flash WP 1 INTRODUCTION EXECUTIVE
More informationHadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013
Hadoop Hardware : Size does matter. @joep and @eecraft Hadoop Summit 2013 v2.3 About us Joep Rottinghuis Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter Follow me @joep Jay
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationCS 294: Big Data System Research: Trends and Challenges
CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/) 1 Big Data First papers:»
More informationTachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks Haoyuan Li Ali Ghodsi University of California, Berkeley {haoyuan,alig}@cs.berkeley.edu Matei Zaharia MIT, Databricks matei@mit.edu
More informationEMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.
EMC Federation Big Data Solutions 1 Introduction to data analytics Federation offering 2 Traditional Analytics! Traditional type of data analysis, sometimes called Business Intelligence! Type of analytics
More information