GS Big Data Platform



Similar documents
Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

How Companies are! Using Spark

HDP Hadoop From concept to deployment.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Luncheon Webinar Series May 13, 2013

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Management and Security

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

How to Hadoop Without the Worry: Protecting Big Data at Scale

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Hadoop IST 734 SS CHUNG

Dominik Wagenknecht Accenture

Data Security in Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Large scale processing using Hadoop. Ján Vaňo

Information Architecture

Navigating the Big Data infrastructure layer Helena Schwenk

Hadoop Big Data for Processing Data and Performing Workload

Hadoop & Spark Using Amazon EMR

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

June Production Hadoop systems in the enterprise

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Native Connectivity to Big Data Sources in MSTR 10

Hadoop and Map-Reduce. Swati Gore

Hadoop. Sunday, November 25, 12

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

How To Scale Out Of A Nosql Database

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

SQL Server 2012 Parallel Data Warehouse. Solution Brief

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Dell In-Memory Appliance for Cloudera Enterprise

Please give me your feedback

CitusDB Architecture for Real-Time Big Data

Upcoming Announcements

I/O Considerations in Big Data Analytics

The Internet of Things and Big Data: Intro

BIG DATA TRENDS AND TECHNOLOGIES

How To Handle Big Data With A Data Scientist

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Hadoop Ecosystem B Y R A H I M A.

Big Data: Making Sense of it all!

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Chase Wu New Jersey Ins0tute of Technology

White Paper: What You Need To Know About Hadoop

Trafodion Operational SQL-on-Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HDP Enabling the Modern Data Architecture

Apache Hadoop: The Big Data Refinery

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

NoSQL for SQL Professionals William McKnight

Big data for the Masses The Unique Challenge of Big Data Integration

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Ubuntu and Hadoop: the perfect match

The Future of Data Management

Big Data and Apache Hadoop Adoption:

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Testing 3Vs (Volume, Variety and Velocity) of Big Data

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA What it is and how to use?

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Modern Data Architecture for Predictive Analytics

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Comprehensive Analytics on the Hortonworks Data Platform

#TalendSandbox for Big Data

A Brief Introduction to Apache Tez

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

Big Data Technologies Compared June 2014

Data processing goes big

Certified Big Data and Apache Hadoop Developer VS-1221

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Evolution from Big Data to Smart Data

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Workshop on Hadoop with Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Modeling for Big Data

Transcription:

GS Big Data Platform

DataPhilosophy 1 Instrument everything 2 Put all data in one place 3 Data first, questions later 4 Store first, structure later 5 Let everyone party on the data (with controls) 6 Keep raw data forever 7 Produce tools to support the whole research cycle 8 Modular and composable infrastructure 2

DistributedSystems;DistributedData DistributedSystems;DistributedData Our Big smalldataproblem Highlyfunc.onallyaligned systems ExcellentDataSegrega.on LocalDataAutonomy LocalGovernance&reten.on Locallynego.ateddataevolu.on Extensiveabuseofdata movementtechnologies SharedData (referencedata)is broadlydisseminated,butmostly fromcentralloca.ons Event data(transac.ons)flow acrosssystemsandpersistedat eachstage Werarelyusedcentralized sharedserviceslikethereference datafarm OurDataisan Asset (andshould betreatedassuch( 3

BigDataPla:ormGoals Createa GSDataLake toallowformanydatasetstocoexistandbe availablewhichisexternaltoanyspecificgsapplicafon. Crea.ngadataregistrytostorethedatasetmetadataandallowfordatasetstobe discoveredandused. Createafacilitytoproperlyen.tleaccesstothedatasets(thatcodeistypically customlogicembeddeddirectlywithineachapplica.on) Createthefacili.estoingestthedataandprovideresiliency. BuildanintegratedsoRwarestackusingmul.pledatamanagementsoRware productstoprovidethefullsuiteoffunc.onrequiredforthedatalake. Thepla:ormwillbecomposedofmulFpleproductsintegratedand madeconsistentbygsdevelopedinfrastructure. HDFS/Hivefordeeppetabytescaleonlinearchive MPPColumnXstoreParAccelforhighperformanceaggrega.on/pivo.ng GraphdatabasefornonXrela.onalqueriesandseman.csearch Textsearchfunc.onforunstructuredorsemiXstructureddatasets Metadataregistryanden.tlementmodel Pleasesendyourques.onstoLiveCurrent@gs.comwiththesubject#spark 4

TheevoluFonofscaleIoutdatamanagementpla:orms PerhapsthesinglebiggestfactorinenablingBigDataistherapid innovafonoccurringforthe scaleiout ofdatamanagementpla:orms. DBMSrunsonsinglehostonly(tradi.onalRDBMS) DBMSrunsonmul.plehosts,singlecopyofdata,sta.callypar..oned(DB2DPF) DBMSrunsonmul.plehosts,twocopiesofdata,sta.callypar..oned(ParAccel) DBMSrunsonmul.plehosts,manycopiesofdata,sta.callypar..oned(MongoDB) DBMSrunsonmul.plehosts,manycopiesofdata,dynamicpar.oning(HXBase) 5

BigDataPla:ormDesiredProperFes Scalable No fixed upper bound to ultimate dataset size. Storage and CPU capacity must be able to be increased in an incremental and linear fashion. Platform technology stack should already be in use at larger scale than GS use-case. Affordable Technology hardware stack should be based on a scale-out of commodity components. Technology software stack should be based on open source projects. Platform should be designed to run on GS Dynamic Compute nodes Vendor lock-in for any unique portion of the platform should be avoided when possible. Operating cost of platform should be kept to a minimum via low touch infra-structure that self manages. Trusted Entire be resilient to individual component failure not requiring any manual intervention Must be easy to both self heal from failure and to scale-out additional capacity Must have facilities to allow for authentication, security and access entitlement 6

BigDataPla:ormOngoingResearch(1) EnFtlementforBigDatapla:ormdatasets Two different entitlement problems to be solved for: How to model the entitlement rules on who should be able to see what data. How to implement those rules within the platform. Products such as sqrrl and Accumulo are being looked at to provide the fine grained access control. Alternatively the rules could be implemented in a GS access layer software BigGraph Graph databases can be powerful, allowing for queries that are difficult to express in SQL. Graph databases do not easily lend themselves to data shard ing and scale-out. YarcData Urika product is being looked at for high performance Big Graph solution. Aurelius Titan graph database also being tested. TextSearchDataStore/SemanFcSearchDataStore Entitling data stored in an unstructured or semi-structured manner poses new challenges Elasticsearch product is used in several different applications within GS Attivio product is also in use at GS 7

BigDataPla:ormOngoingResearch(2) BigDatapla:ormIdatamovement Information loaded to the Big Data platform should be considered immutable. Data fed into the Big Data platform will need to be stored identically on multiple clusters. Gigabus will be instrumental in creating serialized streams of data across the Big Data platform Any new data created on the Big Data platform will need to be streamed back into the platform BigDatapla:orm dataretenfon Traditional concepts such as a database backup or transaction log need to be completely rethought for the Big Data platform. Forcing all data through a product with data retention such as Gigabus should be enforced. All products that feed data unto the Big Data platform should have a method of replaying datasets unto the platform on request. BigDatapla:orm workloadmanagement All things being equal a fewer number of clusters is preferable to a greater number of clusters. YARN and other technologies are being tested to understand workload management functions Hadoop data federation technologies are being tested to bridge multiple products. 8

BigDataProductStatusQ22014 GSBigDataCatalog Runs on standard Dynamic Compute nodes. Utilizes CKAN open source metadata repository application and UI. Metadata is externalized in Google DSPL format. Entire registry stack can be extended for GS specific requirements. HortonworksHadoop2.06 Runs on Dynamic Compute large storage nodes. Major engineering effort underway to have full Kerberos integration. Standard monitoring to Fabric with 24x7 support team. HDFS file based technologies such as M-R, PIG and Hive currently used in production. H-Base key value database currently used in production. Site resiliency and data retention will not be provided via the Hadoop stack ParAccelRelaFonalDBMS Runs on Dynamic Compute large storage nodes. Mature high performance MPP columnar RDBMS. Cluster is inelastic and does not keep multiple copies of data. 9

Appendix 10

UsingR:Sta.s.calDataAnalysis Youmayhaveheardof. PredicFveAnalyFcs,DataMining,DataAnalysis,StaFsFcalAnalysis,DataVisualizaFon, BusinessIntelligence,BigData Whousesit?(whodoesn t?) Google,Facebook,DoubleIClick,LinkedIn CreditCardCompanies,InsuranceCompanies,Finance(ofcourse) Anywhereyouwanttoextractvaluefromyourdata DevelopmentPaeerns DataVisualizaFon Graphingsta.s.calsummariesofdatatogaininsights ModelingandPredicFon Modelthesystemusingsta.s.calmodels,thenusethosemodelstochecknewdata Datavisualiza.onusedtounderstandhowthemodelperforms RisagreatwayforprogrammerstodostaFsFcs 11