Big Data Trends and Best Practices. Peter Linnell Big Data Team @ SUSE Apache Bigtop PMC plinnell@suse.com plinnell@apache.org

Similar documents

Deploying Hadoop with Manager

TUT5605: Deploying an elastic Hadoop cluster Alejandro Bonilla

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Running SAP HANA One on SoftLayer Bare Metal with SUSE Linux Enterprise Server CAS19256

SUSE Storage. FUT7537 Software Defined Storage Introduction and Roadmap: Getting your tentacles around data growth. Larry Morris

Big Data With Hadoop

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Peers Techno log ies Pv t. L td. HADOOP

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HO5604 Deploying MongoDB. A Scalable, Distributed Database with SUSE Cloud. Alejandro Bonilla. Sales Engineer abonilla@suse.com

How to Hadoop Without the Worry: Protecting Big Data at Scale

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Too Big To Ignore

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

SUSE Linux uutuudet - kuulumiset SUSECon:sta

Hadoop IST 734 SS CHUNG

<Insert Picture Here> Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop and Map-Reduce. Swati Gore

Testing Big data is one of the biggest

Workshop on Hadoop with Big Data

Big Data Management and Security

Big Data Course Highlights

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Dominik Wagenknecht Accenture

Upcoming Announcements

Data processing goes big

#TalendSandbox for Big Data

Large scale processing using Hadoop. Ján Vaňo

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Introduction to Big Data Training

Apache HBase. Crazy dances on the elephant back

NoSQL and Hadoop Technologies On Oracle Cloud

We are watching SUSE

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Internals of Hadoop Application Framework and Distributed File System

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Advanced Systems Management with Machinery

A very short Intro to Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

How To Scale Out Of A Nosql Database

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

DevOps and SUSE From check-in to deployment

Moving From Hadoop to Spark

HDP Enabling the Modern Data Architecture

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Apache Hadoop. Alexandru Costan

BIG DATA TECHNOLOGY. Hadoop Ecosystem

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Using SUSE Linux Enterprise to "Focus In" on Retail Optical Sales

Hadoop implementation of MapReduce computational model. Ján Vaňo

BIG DATA TRENDS AND TECHNOLOGIES

Oracle Big Data SQL Technical Update

Build Platform as a Service (PaaS) with SUSE Studio, WSO2 Middleware, and EC2 Chris Haddad

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Certified Big Data and Apache Hadoop Developer VS-1221

MapReduce with Apache Hadoop Analysing Big Data

White Paper: What You Need To Know About Hadoop

MySQL and Hadoop. Percona Live 2014 Chris Schneider

CSE-E5430 Scalable Cloud Computing Lecture 2

Installing, Tuning, and Deploying Oracle Database on SUSE Linux Enterprise Server 12 Technical Introduction

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

HDP Hadoop From concept to deployment.

Complete Java Classes Hadoop Syllabus Contact No:

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop & Spark Using Amazon EMR

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

COURSE CONTENT Big Data and Hadoop Training

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CA Big Data Management: It s here, but what can it do for your business?

High Availability and Disaster Recovery for SAP HANA with SUSE Linux Enterprise Server for SAP Applications

High Availability Storage

Tap into Hadoop and Other No SQL Sources

BBM467 Data Intensive ApplicaAons

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

The Future of Data Management with Hadoop and the Enterprise Data Hub

Relax-and-Recover. Johannes Meixner. on SUSE Linux Enterprise 12.

So What s the Big Deal?

Implementing Linux Authentication and Authorisation Using SSSD

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Transcription:

Big Data Trends and Best Practices Peter Linnell Big Data Team @ SUSE Apache Bigtop PMC plinnell@suse.com plinnell@apache.org

A little bit about me Scribus Founder and Core Team Member since 2001 Ex-Cloudera Kitchen Team OpenSUSE Community member since 2006 OpenSUSE Board Member Apache Bigtop Founder and PMC Packager and contributor for many Open Source apps Day Job SUSE Systems Engineer in Silicon Valley HPC/Big Iron Fan 2

Dilbert on Big Data 3

Hype Cycle 4

Linux is the Foundation for Big Data Scale Low Cost Commodity Hardware No Lock In Coopetition 5

6

Big Data The Jargon List Hadoop Core Hadoop is a Data Operating System Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. NoSQL A way of storing data, mostly in memory for quickly searching for data. Data has a temperature: Cold Data stored nearby Hot / Fast in memory or intelligent chaching Live Data Accessible to Big Data Tools Dead Data = Offline Data ACID - Atomicity, Consistency, Isolation, Durability Sharding see Wikipedia it is too complicated :-) 7

Big Data Challenges Existing data workflows are siloed Data is siloed Formats, proprietary applications Sensitive Data Concerns Regulatory Blockages Budget Constraints Planning Lead Times 8

Big Data Challenges 9 Data Scrubbing is the step never mentioned but indeed can be one of the biggest challenges. Big Data likes memory aka storage. Jobs can run longer than some typical mainframe or batch jobs. Hadoop turns the computing notion of bringing data to processing power on its head. You bring the compute power to where the data resides.

Big Data Advantages on Mainframes Many Big Data solutions are batch oriented and thus work well within a shared environment. Much of the Big Data platform is open source. No vendor lock-in. See Apache Bigtop Enables providing new and innovate services for end users, as well as helping with better informed decision making. 10

Big Data Advantages on Mainframe 95% of Hadoop Ecosystem written in Java (this is good and bad!) Hadoop stacks are typicaly not compute bound, but I/O bound. Guess what platform excels here? 11 Lots of valuable corporate and government data reside on mainframes.

Examples of Big Data volumes 12 Scientific measurements (i. e. particle collision results from the Large Hadron Collider at the CERN) Financial data like stock information, share-price statistical data, stock related press coverage, etc. Medical data: genome database, patient's files in hospitals, information about pharmaceutical Indexed web or social media content Environmental Records - Weather Webserver Access-logs Sales data

Five main use cases for Big Data Transparency: insights into ongoing business operations Decision-testing: What happened (will happen) when (if) we made (make) this decision? Individualization in real time: tailoring offerings and services to customer wishes in real time in order to increase customer satisfaction and reduce customer churn Intelligent process control and automation Innovative data-driven business models From Big Data in Action - http://en.sap.info/big-data-in-action/82754?source=email-en-sapinfonewsletter-20121204 13

How to distinguish between several kinds of Big Data? 14 Amount of data: large (n terabytes) or very large (n petabytes) or gigantic (n exabytes)? Structured data (i. e. relational, column separated) or unstructured data (i. e. documents, webpages)? How complex is the data model? Transactional or non-transactional? Full data integrity required ACID? Usage patterns: Just lots of reads or also many inserts, updates and deletions? Usage performance: Realtime, short delays, long delays? Combination of several questions from above

Hadoop vs SQL (RDBMS) No predefined schema Schema defined in advance Fast Loading Data transformed Simpler Data Structures Fast Reading Flexible and Agile Standards/Governance The real innovation is the capability to explore original raw data 15

When to pick Hadoop vs RMDBS Scalablity is important Speed is important Structured or Unstructured ACID Transactions Interactive Analytics Complex Data Process A sports car is faster, but a truck can carry more. 16

Apache Hadoop Strengths Huge data volumes Unstructured data Reliable Scalable Lowest cost Open source No hardware lock in Batch processing 17

Apache Hadoop Weakenesses Not very efficient at small scale Real time is challenging at the moment (WIP) Requires skilled engineers and operations Less mature than SQL Weakly defined user roles in data access model (WIP) 18

What About NoSQL/NewSQL? Can be a cost effective replacement or supplement for traditional proprietary databases. There are several e.g MongoDB, Accumulo, Cassandra trying to solve different problems. Each has strengths and weaknesses to evaluate. 19

Linux Challenges Scalability We're hitting the limit of physics with current technology. The need for better fault tolerance in the O/S. Now helped by live kernel patching in Linux 4.1. The future will bring us exascale challenges. Think 3-7 years down the road. 1018 Java scalability? Stutter affects Hadoop 20

Emerging Trends in Big Data Streaming accessing data in near real time for capture and analysis. Fast Data - in memory or intellignet caching. E.g. Spark, SAP HANA, HP Haven. Connectors are becoming ubiquitous Machine learning is becoming more accessible. Despite performance degridation, Cloud is becoming a more usable option for production. 21

Evaluation Thoughts Is Big Data a solution in search of a problem? Evaluate the need for real time data vs. near real time. Do we have right questions to ask? How can Big Data workflows be integrated with our existing infrastructure? What other agencies might have useful data? Pilot Pilot Pilot... 22

Evaluation Thoughts Pilot Pilot Pilot... 23

SUSE Big Data Lab Big Data Cluster in Provo UT for: Benchmarking Software certification Integration / test Reference architectures Demo system Remotely accessible 24

SUSE Big Data Partner Ecosystem Integrated solutions SAP HANA Teradata Aster Big Analytics Appliance Hadoop Distributions Intel Cloudera Hortonworks WANdisco Database 25 Intersystems CACHÉ

Bigtop 26 Packaging, QA testing and integration stack for Apache Hadoop components Made up of engineers from all the most of the Hadoop distros: Cloudera, Hortonworks and WANdisco,along with SUSE and independent contributors Almost unique among other Apache projects in that it integrates other projects as its goal All major Hadoop distros base their product on Bigtop

Why SUSE for Big Data? 27 SUSE has a decade plus of leadership in HPC/Supercomputing for Linux. Est 50% Top 500. Titan the biggest runs SLES. SLES11 SP3 has the most modern optimized kernel for Big Data work loads. We have Tier 1 support and relationships with all major open source Hadoop Distributors. Competition sees Big Data as an opportunity to sell proprietary solutions. We care about this market.

Questions? bigdata@suse.com 28

29 Corporate Headquarters +49 911 740 53 0 (Worldwide) Join us on: Maxfeldstrasse 5 90409 Nuremberg Germany www.suse.com www.opensuse.org

Appendix

Hadoop Core Components 31

Typical Hadoop Distribution 32

How Hadoop Works at Its Core Metadata ops Client Namenode Metadata (name, replicas, ): /home/foo/data, 3,... Read Block ops Rack 1 Rack 2 Replication Blocks Datanodes Datanodes Write Client 33

Hadoop is only one part But an important part 34 The compute layer of big data Supports the running of applications on large clusters of commodity hardware. Provides a distributed file system (HDFS) that stores data on the compute nodes. Enables applications to work with thousands of computers and petabytes of data. Lots of momentum IBM, Microsoft, Oracle, SAP, EMC, HP, Teradata, have built solutions on Hadoop or at least connectors to Hadoop Ecosystem of Hadoop players: Intel, Cloudera, HortonWorks, WANdisco, MapR, Greenplum Apache support

NameNode 35 The NameNode (NN) stores all metadata Information about file locations in HDFS Information about file ownership and permissions Names of the individual blocks Location of the blocks Metadata is stored on disk and read when the NameNode daemon starts

NameNode2 36 File name is fsimage Block locations are not stored in fsimage Changes to the metadata are made in RAM Changes are also written to a log file on disk called edits Each Hadoop cluster has a single NameNode The Secondary NameNode is not a fail-over NameNode The NameNode is a single point of failure (SPOF)

Secondary NameNode (master) 37 The Secondary NameNode (2NN) is not-a fail-over NameNode! It performs memory/intensive administrative functions for the NameNode. Secondary NameNode periodically combines a prior file system snapshot and editlog into a new snapshot New snapshot is transmitted back to the NameNode Secondary NameNode should run on a separate machine in a large installation It requires as much RAM as the NameNode

DataNode 38 DataNode (slave) JobTracker (master) / exactly one per cluster TaskTracker (slave) / one or more per cluster

Running Jobs 39 A client submits a job to the JobTracker JobTracker assigns a job ID Client calculates the input and splits for the job Client adds job code and configuration to HDFS The JobTracker creates a Map task for each input split TaskTrackers send periodic heartbeats to JobTracker These heartbeats also signal readiness to run tasks JobTracker then assigns tasks to these TaskTrackers

Running Jobs 40 The TaskTracker then forks a new JVM to run the task This isolates the TaskTracker from bugs or faulty code A single instance of task execution is called a task attempt Status info periodically sent back to JobTracker Each block is stored on multiple different nodes for redundancy Default is three replicas

Anatomy of a File Write 1. Client connects to the NameNode 2. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client 3. Client connects to the first DataNode and starts sending data 4. As data is received by the first DataNode, it connects to the second and starts sending data 5. Second DataNode similarly connects to the third 6. Ack packets from the pipeline are sent back to the client 7. Client reports to the NameNode when the block is written 41

Hadoop Core Operations Review Metadata ops Client Namenode Metadata (name, replicas, ): /home/foo/data, 3,... Read Block ops Rack 1 Rack 2 Replication Blocks Datanodes Datanodes Write Client 42

Expanding on Core Hadoop 43

Hive, Hbase and Sqoop Hive High level abstraction on top of MapReduce Allows users to query data using HiveQL, a language very similar to standard SQL HBase A distributed, sparse, column oriented data store Sqoop 44 The Hadoop ingestion engine the basis of connectors like Teradata, Informatica, DB2 and many others.

Oozie 45 Work flow scheduler system to manage Apache Hadoop jobs Workflow jobs are Directed Acyclical Graphs (DAGs) of actions Coordinator jobs are recurrent Workflow jobs triggered by time (frequency) and data availabilty Integrated with the rest of the Hadoop stack Supports several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) Also supports system specific jobs (such as Java programs and shell scripts)

Flume 46 Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Simple and flexible architecture based on streaming data flows Robust and fault tolerant with tunable reliability mechanisms and many fail-over and recovery mechanisms Uses a simple extensible data model that allows for online analytic application

Mahout 47 The Apache Mahout machine learning library's goal is to build scalable machine learning libraries Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like Clustering, for example, takes text documents and groups them into groups of topically related documents Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category

Whirr Set of libraries for launching Hadoop instances on clouds A cloud-neutral way to run services You don't have to worry about the idiosyncrasies of each provider. A common service API The details of provisioning are particular to the service. Smart defaults for services 48 You can get a properly configured system running quickly, while still being able to override settings as needed

Giraph 49 Iterative graph processing system built for high scalability Currently used at Facebook to analyze the social graph formed by users and their connections

Apache Pig 50 Platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs Language layer currently consists of a textual language called Pig Latin, which has the following key properties: Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Extensibility. Users can create their own functions to do special-purpose processing.

Ambari Project goal is to develop software that simplifies Hadoop cluster management Provisioning a Hadoop Cluster Managing a Hadoop Cluster Monitoring a Hadoop Cluster 51 Ambari leverages well known technology like Ganglia and Nagios under the covers. Provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs

HUE Hadoop User Experience 52 Graphical front end to Hadoop tools for launching, editing and monitoring jobs Provides short cuts to various command line shells for working directly with components Can be integrated with authentication services like Kerberos or Active Directory

R Statistical Language 53 Statistical Language Open Source Licensed Similar to Octave or Mathlab Not currently packaged for SLES or opensuse

Shark/Spark 54 Spark is a real time query framework developed at Berkeley AMP. Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. Shark uses Spark to process real time queries in Hive. Up to 100x faster than MapReduce in some cases. Going in to most Hadoop distros now or soon.

Zookeeper 55 An orchestration stack. Centralized service for: Maintaining configuration information Naming Providing distributed synchronization Delivering group services.

NoSQL Cassandra 56 Enterprise provider is Datastax Keyspace -> container for column families High Performance, Highly Scalable, Available - No SPOF Replication by hashing data between nodes Query by Column - Requires index SQL-Like Native support for Apache Hadoop Flexible Schema -> Change at runtime. No transactions, no JOINs

NoSQL (cont) Accumulo 57 Like Hbase, a BigTable clone. Join-Less Runs on top of Hadoop. MapReduce with hadoop. Used for scanning large two-dimensional tables Accumulo, HBase and Cassandra are part of the Hadoop ecosystem. HBase supported by the Hadoop provider. Hugely scalable NoSQL database developed at NSA. Only NoSQL DB with cell level locking and security..

NoSQL (cont) MongoDB 58 Enterprise provider MongoDB Inc, was known as 10gen Non-Relational DataStore for JSON Documents {"name":"alejandro"} {"name":"alejandro", "Age": 31, likes:["soccer","golf", "Beach"]} Schemaless, container vs table, document vs row Does not support JOINs or transactions (across multiple documents). Does not perform as memcached, not as functional as RDBMS. Sits in the middle.

NoSQL (cont - MongoDB) 59 Provides the "mongo" shell - JavaScript interpreter, tools and drivers for easy access to API. Support replication and sharding. Supports an aggregation framework, mapreduce, Hadoop plugin. Document size Max 16MB -> GridFS to store big data + metadata.

Web UI Ports for Users 60 Daemon Default Port Configuration parameter NameNode 50070 dfs.http.address DataNode 50075 dfs.datanode.http.address Secondary NameNode 50090 dfs.secondary.http.address Backup/Checkpoint Node 50105 dfs.backup.http.address JobTracker 50030 mapred.job.tracker.http.address TaskTracker 50060 mapred.task.tracker.http.address

http://bigdatauniversity.com/ https://ccp.cloudera.com/display/doc/documen tation http://thecloudtutorial.com/hadoop-tutorial.html http://www.saphana.com/community/learn http://developer.yahoo.com/hadoop/tutorial/ http://www.ibm.com/developerworks/data/library/ techarticle/dm-1209hadoopbigdata/ 61

Resources SUSE Big Data website SUSE Big Data Flyer 62 https://www.suse.com/solutions/platform.html#big_data http://www.novell.com/docrep/2013/03/suse_linux_enterpri se_foundation_for_big_data_solution.pdf SUSE Big Data Contacts Business: Frank Rego frego@suse.com Technical: Peter Linnell plinnell@suse.com

63 Corporate Headquarters +49 911 740 53 0 (Worldwide) Join us on: Maxfeldstrasse 5 90409 Nuremberg Germany www.suse.com www.opensuse.org

Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.