Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Similar documents
Big Data With Hadoop

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Ecosystem B Y R A H I M A.

Hadoop and Map-Reduce. Swati Gore

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Architecture. Part 1

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Implement Hadoop jobs to extract business value from large and varied data sets

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop IST 734 SS CHUNG

Certified Big Data and Apache Hadoop Developer VS-1221

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Technology Core Hadoop: HDFS-YARN Internals

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache HBase. Crazy dances on the elephant back

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

<Insert Picture Here> Big Data

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A Brief Outline on Bigdata Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HADOOP MOCK TEST HADOOP MOCK TEST I

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Distributed Filesystems

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Design and Evolution of the Apache Hadoop File System(HDFS)

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Workshop on Hadoop with Big Data

Hadoop. Sunday, November 25, 12

Introduction to Big Data Training

A very short Intro to Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HDFS: Hadoop Distributed File System

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Apache Hadoop: Past, Present, and Future

Hadoop Distributed File System. Dhruba Borthakur June, 2007

How To Scale Out Of A Nosql Database

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop: Embracing future hardware

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Deploying Hadoop with Manager

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Intro to Map/Reduce a.k.a. Hadoop

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Apache Hadoop. Alexandru Costan

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Qsoft Inc

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Hadoop Big Data for Processing Data and Performing Workload

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

HDFS Architecture Guide

MapReduce with Apache Hadoop Analysing Big Data

Dell In-Memory Appliance for Cloudera Enterprise

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop Job Oriented Training Agenda

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

CDH AND BUSINESS CONTINUITY:

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Accelerating and Simplifying Apache

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Case Study : 3 different hadoop cluster deployments

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop: The Definitive Guide

Internals of Hadoop Application Framework and Distributed File System

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

HDFS Users Guide. Table of contents

HDFS. Hadoop Distributed File System

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Transcription:

Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti Kallonen helping with technicalities answering questions Course work Exam!

Communication Communication between students and personnel is carried out with Slack team collaboration tool all students will get an e-mail invitation the main channel for asking questions

Exam The popularity of the course surprised For the course work we are in plan C grading 150 students with our current plan does not work therefore, the course will have an exam sorry

Course Work Hands on course Course work Groups of three students Java and Python (maybe Scala) Hadoop, HDFS, MapReduce, Spark, Hue, Traffic data by Finnish Transport Agency Report

Course Work Cloudera CDH 5 -virtual machine Each group should install to own laptop Virtualbox, for instance Data preparation, import, analysis, dissemination The tasks will be published during the course This week s task: form a group of three

Today Big data Data Science Hadoop HDFS

Big Data World is drowning in data click stream data is collected by web servers NYSE generates 1 TB trade data every day MTC collects 5000 attributes for each call Smart marketers collect purchasing habits More data usually beats better algorithms

Three Vs of Big Data Volume: amount of data Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-tomachine data Velocity: speed of data in and out streaming data from RFID, sensors, Variety: range of data types and sources structured, unstructured

Big Data Variability Data flows can be highly inconsistent with periodic peaks Complexity Data comes from multiple sources. linking, matching, cleansing and transforming data across systems is a complex task

Data Science Definition: Data science is an activity to extracts insights from messy data Facebook analyzes location data to identify global migration patterns to find out the fanbases to different sport teams A retailer might track purchases both online and in-store to targeted marketing

Data Science

New Challenges Compute-intensiveness raw computing power Challenges of data intensiveness amount of data complexity of data speed in which data is changing

Data Storage Analysis Hard drive from 1990 store 1,370 MB speed 4.4 MB/s Hard drive 2010s store 1 TB speed 100 MB/s

Scalability Grows without requiring developers to rearchitect their algorithms/application Horizontal scaling Vertical scaling

Parallel Approach Reading from multiple disks in parallel 100 drives having 1/100 of the data => 1/100 reading time Problem: Hardware failures replication Problem: Most analysis tasks need to be able to combine data in some way MapReduce Hadoop

Hadoop Hadoop is a frameworks of tools libraries and methodologies Operates on large unstructured datasets Open source (Apache License) Simple programming model Scalable

Hadoop A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) Core Hadoop has two main systems: Hadoop Distributed File System: self-healing highbandwidth clustered storage MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

Hadoop Administrators Installation Monitor/Manage Systems Tune Systems End Users Design MapReduce Applications Import Export data Work with various Hadoop Tools

Hadoop Developed by Doug Cutting and Michael J. Cafarella Based on Google MapReduce technology Designed to handle large amounts of data and be robust Donated to Apache Foundation in 2006 by Yahoo

Hadoop Design Principles Moving computation is cheaper than moving data Hardware will fail, manage it Hide execution details from the user Use streaming data access Use simple file system coherency model Hadoop is not a replacement for SQL, always fast and efficient quick ad-hoc querying

Hadoop MapReduce Collocate data with compute node data access is fast since its local (data locality) Network bandwidth is the most precious resource in the data center MR implementations explicit model the network topology MR operates at a high level of abstraction programmer thinks in terms of functions of key and value pairs

Hadoop MapReduce MR is a shared-nothing architecture tasks do not depend on each other failed tasks can be rescheduled by the system Invented by Google used for producing search indexes applicable to many other problems too

Hadoop Ecosystem Common A set of components and interfaces for distributed file systems and general I/O Avro A serialization system for efficient, cross-language RPC and persistent storage MapReduce Distributed data processing model and execution environment

Hadoop Ecosystem HDFS A Distributed filesystem Pig, Hive HBase, ZooKeeper, Sqoop, Oozie

RDBMS vs HDFS Schema-on-Write (RDBMS) Schema must be created before any data can be loaded An explicit load operation which transforms data to DB internal structure New columns must be added explicitly before new data for such columns can be loaded into the DB Schema-on-Read (HDFS) Data is simply copied to the file store, no transformation is needed A SerDe (Serializer /Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it

Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.

Hadoop Distributed File System Hadoop comes with distributed file system called HDFS (Hadoop Distributed File System) Based on Google s GFS (Google File System) HDFS provides redundant storage for massive amounts of data using commodity hardware Data in HDFS is distributed across all data nodes Efficient MapReduce processing

HDFS Design File system on commodity hardware Survives even with high failure rates of the components Supports lots of large files File size hundreds GB or several TB Main design principles Write once, read many times Rather streaming reads, than frequent random access High throughput is more important than low latency

HDFS Architecture HDFS operates on top of existing file system Files are stored as blocks (default size 64 MB, different from file system blocks) File reliability is based on block-based replication Each block of a file is typically replicated across several DataNodes (default replication is 3) NameNode stores metadata, manages replication and provides access to files No data caching (because of large datasets), but direct reading/streaming from DataNode to client

HDFS Architecture NameNode stores HDFS metadata filenames, locations of blocks, file attributes Metadata is kept in RAM for fast lookups The number of files in HDFS is limited by the amount of available RAM in the NameNode HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace

HDFS Architecture DataNode stores file contents as blocks Different blocks of the same file are typically stored on different DataNodes Same block is typically replicated across several DataNodes for redundancy Periodically sends report of all existing blocks to the NameNode DataNodes exchange heartbeats with the NameNode

HDFS Architecture Built-in protection against DataNode failure If NameNode does not receive any heartbeat from a DataNode within certain time period, DataNode is assumed to be lost In case of failing DataNode, block replication is actively maintained NameNode determines which blocks were on the lost DataNode The NameNode finds other copies of these lost blocks and replicates them to other nodes

High-Availability (HA) Issues: NameNode Failure NameNode failure corresponds to losing all files on a file system % sudo rm --dont-do-this / For recovery, Hadoop provides two options Backup files that make up the persistent state of the file system Secondary NameNode Also some more advanced techniques exist

HA Issues: the secondary NameNode The secondary NameNode is not mirrored NameNode Required memory-intensive administrative functions NameNode keeps metadata in memory and writes changes to an edit log The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large Keeps a copy of the merged namespace image, which can be used in the event of the NameNode failure Recommended to run on a separate machine Requires as much RAM as primary NameNode

Network Topology HDFS is aware how close two nodes are in the network From closer to further 0: Processes in the same node 2: Different nodes in the same rack 4: Nodes in different racks in the same data center 6: Nodes in different data centers

File Block Placement Clients always read from the closest node Default placement strategy One replica in the same local node as client Second replica in a different rack Third replica in different, randomly selected, node in the same rack as the second replica Additional (3+) replicas are random

Balancing Hadoop works best when blocks are evenly spread out Support for DataNodes of different size In optimal case the disk usage percentage in all DataNodes approximately the same level Hadoop provides balancer daemon Re-distributes blocks Should be run when new DataNodes are added

Accessing Data Data can be accessed using various methods Java API: Demo C API Command line / POSIX (FUSE mount) Command line / HDFS client: Demo HTTP

HDFS URI All HDFS (CLI) commands take path URIs as arguments URI format scheme://authority/path The scheme and authority are optional If not specified the default (set in the configuration file) one are used

Conclusions Pros Support for very large files Designed for streaming data Commodity hardware Cons Not designed for low-latency data access Architecture does not support lots of small files No support for multiple writers / arbitrary file modifications (Writes always at the end of the file)