Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library



Similar documents
Hadoop IST 734 SS CHUNG

A Performance Analysis of Distributed Indexing using Terrier

Implement Hadoop jobs to extract business value from large and varied data sets

Data Domain Profiling and Data Masking for Hadoop

Case Study : 3 different hadoop cluster deployments

Cloudera Certified Developer for Apache Hadoop

Using distributed technologies to analyze Big Data

Open source Google-style large scale data analysis with Hadoop

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

MapReduce, Hadoop and Amazon AWS

Oracle Big Data SQL Technical Update

Workshop on Hadoop with Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Internals of Hadoop Application Framework and Distributed File System

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop & its Usage at Facebook

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop: Embracing future hardware

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop and Map-Reduce. Swati Gore

NEXTGEN v5.8 HARDWARE VERIFICATION GUIDE CLIENT HOSTED OR THIRD PARTY SERVERS

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Accelerating and Simplifying Apache

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop & its Usage at Facebook

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hardware/Software Guidelines

Apache Hadoop: The Big Data Refinery

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

THE HADOOP DISTRIBUTED FILE SYSTEM

Real Time Big Data Processing

Apache Hadoop FileSystem and its Usage in Facebook

Big Fast Data Hadoop acceleration with Flash. June 2013

Chapter 7. Using Hadoop Cluster and MapReduce

How To Use Hadoop For Gis

BIG DATA What it is and how to use?

Dell Reference Configuration for Hortonworks Data Platform

Testing Big data is one of the biggest

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Duke University

Data processing goes big

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

I/O Considerations in Big Data Analytics

Big Data and Apache Hadoop s MapReduce

Adobe Deploys Hadoop as a Service on VMware vsphere

Architectures for Big Data Analytics A database perspective

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Apache Hadoop: Past, Present, and Future

A Study of Data Management Technology for Handling Big Data

Cloud Computing Where ISR Data Will Go for Exploitation

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Data Mining with Hadoop at TACC

SAP HANA In-Memory Database Sizing Guideline

APACHE HADOOP JERRIN JOSEPH CSU ID#

How To Scale Out Of A Nosql Database

Hadoop Big Data for Processing Data and Performing Workload

Peers Techno log ies Pv t. L td. HADOOP

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

The Greenplum Analytics Workbench


MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large scale processing using Hadoop. Ján Vaňo

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Building a data analytics platform with Hadoop, Python and R

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Archiving the Web: the mass preservation challenge

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Design and Evolution of the Apache Hadoop File System(HDFS)

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Business Intelligence for Big Data

RESPONSES TO QUESTIONS AND REQUESTS FOR CLARIFICATION Updated 7/1/15 (Question 53 and 54)

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Mambo Running Analytics on Enterprise Storage

HDFS Cluster Installation Automation for TupleWare

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Transcription:

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National Library Hardware set-up Open source software architecture Application Scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137). 2

Cultural Heritage and Big Data? Google Books Project 2012: 20 million books scanned (approx. 7,000,000,000 pages) www.books.google.com Europeana 2012: 25 million digital objects All metadata licensed CC-0 www.europeana.eu/portal

Cultural Heritage and Big Data? Hathi Trust 3,721,702,950 scanned pages 477 TBytes www.hathitrust.org Internet Archive 245 billion web pages archived 10 PBytes www.archive.org

What can we expect? Enumerate 2012: only about 4% digitised so far Strong growth of born digital information Source: security.networksasia.net Source: www.idc.com

MapReduce/Hadoop in a nutshell Task1 Task 2 Aggregated Result Output data Task 3 Aggregated Result This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137). 6

Experimental Cluster Task Trackers Job Tracker Name Node CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) 2 TB effective Data Nodes CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (performance) 2 TB effective Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system. 25 processing cores for Map tasks and 10 cores for Reduce tasks

Platform Architecture REST API Taverna Workflow engine Access via REST API Workflow engine for complex jobs Hive as the frontend for analytic queries MapReduce/Pig for Extraction, Transform, and Load (ETL) Small objects in HDFS or HBase Large Digital objects stored on NetApp Filer This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137). 8

Application scenarios Web Archiving Scenario 1: Web Archive Mime Type Identification Austrian Books Online Scenario 2: Image File Format Migration Scenario 3: Comparison of Book Derivatives Scenario 4: MapReduce in Digitised Book Quality Assurance

Key Data Web Archiving Domain harvesting Entire top-level-domain.at every 2 years Selective harvesting Important websites that change regularly Event harvesting Special occasions and events (e.g. elections) Physical storage 19 TB Raw data 32 TB Number of objects 1.241.650.566

Scenario 1: Web Archive Mime Type Identification (W)ARC Container JPG (W)ARC InputFormat (W)ARC RecordReader MapReduce GIF HTM HTM based on HERITRIX Web crawler read/write (W)ARC JPG Apache Tika detect MIME Map Reduce image/jpg 1 image/gif 1 text/html 2 audio/midi 1 image/jpg MID

Scenario 1: Web Archive Mime Type Identification DROID 6.01 TIKA 1.0

Key Data Austrian Books Online Public private partnership with Google Only public domain Objective to scan ~ 600.000 Volumes ~ 200 Mio. pages ~ 70 project team members 20+ in core team ~ 130K physical volumes scanned so far ~ 40 Mio pages

ADOCO (Austrian Books Online Download & Control) Google Public Private Partnership https://confluence.ucop.edu/display/curation/pairtree ADOCO This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137). 14

Task: Image file format migration TIFF to JPEG2000 migration Objective: Reduce storage costs by reducing the size of the images JPEG2000 to TIFF migration Objective: Mitigation of the JPEG2000 file format obsolescense risk Challenges: Scenario 2: Image file format migration Integrating validation, migration, and quality assurance Computing intensive quality assurance

Task: Compare different versions of the same book Images have been manipulated (cropped, rotated) and stored in different locations Images come from different scanning sources or were subject to different modification procedures Challenges: Scenario 2: Comparison of book derivatives Computing intensive (Average runtime per book on a single quad-core server ~ 4,5 hours) 130.000 books, ~320 pages each SCAPE tool: Matchbox

Scenario 3: MapReduce in Quality Assurance ETL Processing of 60.000 books, ~ 24 Million pages Using Taverna s Tool service (remote ssh execution) Orchestration of different types of hadoop jobs Hadoop-Streaming-API Hadoop Map/Reduce Hive Workflow available on myexperiment: http://www.myexperiment.org/workflows/3105 See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-bigdata-processing-chaining-hadoop-jobs-using-taverna

Scenario 3: MapReduce in Quality Assurance Create input text files containing file paths (JP2 & HTML) Read image metadata using Exiftool (Hadoop Streaming API) Create sequence file containing all HTML files Calculate average block width using MapReduce Load data in Hive tables Execute SQL test query 18

Reading image metadata Jp2PathCreator HadoopStreamingExiftoolRead reading files from NAS NAS find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2... Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 Z119599879/00000001 2312 Z119589879/00000002... 2300 Z119589879/00000003 2300... 1,4 GB 1,2 GB 60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h 19

SequenceFile creation HtmlPathCreator SequenceFileCreator reading files from NAS NAS find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 1,4 GB 997 GB (uncompressed) 60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 h 20

Calculate average block width using MapReduce HadoopAvBlockWidthMapReduce Map Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Reduce Z119585409/00000001 2250 Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 SequenceFile Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005... 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 60.000 books (24 Million pages): ~ 6 h Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Textfile 21

Analytic Queries HiveLoadExifData & HiveLoadHocrData htmlwidth hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 CREATE TABLE htmlwidth (hid STRING, hwidth INT) Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jp2width Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT) jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 22

jp2width Analytic Queries HiveSelect htmlwidth jid jwidth hid hwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 2250 Z119585409/00000005 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700 23

Further information Project website: www.scape-project.eu Github repository: www.github.com/openplanets Project Wiki: www.wiki.opf-labs.org/display/sp/home SCAPE tools mentioned SCAPE Platform Thank you! Questions? http://www.scape-project.eu/publication/an-architectural-overviewof-the-scape-preservation-platform Jpylyzer Jpeg2000 validation http://www.openplanetsfoundation.org/software/jpylyzer Matchbox Image comparison https://github.com/openplanets/scape/tree/master/pc-qa-matchbox