SEAIP 2009 Presentation

Size: px

Start display at page:

Download "SEAIP 2009 Presentation"

Mavis Bailey
8 years ago
Views:

1 SEAIP 2009 Presentation By David Tan Chair of Yahoo! Hadoop SIG, ,Singapore EXCO Member of SGF SIG Imperial College (UK), Institute of Fluid Science (Japan) & Chicago BOOTH GSB (USA) Alumni

2 Agenda Problem Domains & Challenges: Scalable large scale data processing 1. CCTV & Video Domains What is Hadoop? Why Yahoo! Hadoop platform? Hadoop Map-Reduce Methodology 2. Grid Generation : Remeshing with Adaptive Refinement of unstructured mesh for CFD transient problems 2

Hadoop platform? Hadoop Map-Reduce Methodology 2.

3 Problem Domains & Challenges Challenges: 1. Very large scale data processing, up to Gigabytes, Terabytes & Petabytes 2. Scalability 3. Cost-effective of Computing Resources (Clouds vs HPC) 4. Pay Per Use Flexibility Problem Domains: 1.CCTV Video Analytics 2.Unstructured Grid Generation 3

4 CCTV & Video Domains Video files are one of the most proliferated data: 1,000s CCTVs added per Month UK alone has 4 million of public CCTVs operated by the government 1,000s Youtube video added In the internet and viewed per day Millions of Searches per month Growing trends of Terabytes - Petabytes of data generated per day! And we crawl the Video Streams for Real-time Live Alerts (On-site Indexing) Forensic Analysis (Post- Processing) 4

per day Millions of Searches per month Growing trends of Terabytes - Petabytes of data generated per day!

5 Processing Large Scale Data! Tedious and Uneconomical Just reading 100 terabytes of data can be overwhelming Takes ~11 days to read on a standard computer Takes a day across a 10Gbit link (very high end storage solution) But it only takes 15 minutes on 1000 standard computers! Using clusters of standard computers, you get Linear scalability Commodity pricing 5

days to read on a standard computer Takes a day across a 10Gbit link (very high end

6 But there are certain problems Reliability problems: in large clusters, computers fail every day, and in different ways For a single machine: MTBF is ~3 years On 1000 machines, MTBF is ~1 day Data is corrupted or lost (disk, memory, network) Computations are disrupted Without a good framework, programming clusters is very hard Newer programming paradigms? Languages? Traditional debugging and performance tools don t apply 6

or lost (disk, memory, network) Computations are disrupted Without a good framework, programming clusters

7 Cloud Computing - a new paradigm? Demands on Quality Accuracy/Integrity I/O Bottlenecks, move computation to storage Storing and processing Big data On demand scalability Opportunities and Challenges Speed of getting actionable Information - insights Significant Data Volumes (unstructured) Predictable performance Reduce operating costs 7

Storing and processing Big data On demand scalability Opportunities and Challenges

8 Programming on the Cloud Good problem: 100s of machine computing resources are made available to us at a low cost. How do we write applications on them? As an example, consider the problem of creating an index or meta-data tagging for search and quick Retrieval (COVIS video demo): We have hundred of hours of video streams We want to build an inverted index (using RGB colour scheme) to perform real-time search and retrieval for forensic analysis We can leverage on Yahoo! Cloud or other commercial cloud platforms for cost effective cloud resources 8

As an example, consider the problem of creating an index or meta-data tagging for search and quick Retrieval (COVIS video demo): We have

9 COVIS Modules COVIS-Live A real-time incident monitoring system for CCTV video surveillance applications. COVIS-Forensics High-fidelity processing of recorded videos for investigative analysis Generates spatiotemporal trends 9

COVIS-Forensics High-fidelity processing of recorded

10 COVIS Live Perimeter Intrusion Entrance Detection Loitering Virtual Tripwire Crowd Gathering People Counting 10

11 COVIS Live Wrong Way Licence Plate Recognition Illegal Parking Unattended Object Remove Object 11

12 COVIS Forensic Processes recorded videos for investigative analysis Generate Meta-Tags on objects found in videos. Meta-Tags include: General Time Location Object Type Humans, Adult, Child Cars, Trucks, Motorcycles Unattended objects Object Feature Color Shape Speed Trajectory Duration Object Behavior Intrusion Loitering Speeding Abnormal trajectory 12

Meta-Tags include: General Time Location Object Type Humans, Adult, Child Cars, Trucks,

13 COVIS Video Retrieval Video retrieval through standard Web-Client Text-based (natural language) query based on Time Location Object type Object feature Object behavior Examples Human in blue loitering Red vehicle speeding 13

Time Location Object type Object feature Object

14 System Architecture 14

15 Why Hadoop? Open Source by Yahoo! Framework for running applications on large clusters built of commodity hardware Lets one easily write and run applications that process vast amounts of data (petabytes). Open Source Top level Apache project Motivation & Wishlist Want to process lots of data ( > 1 TB) on demand (for real-time retrieval) and on a scalable platform Want to parallelize across hundreds/thousands of CPUs Want to make this easy 15

applications that process vast amounts of data (petabytes).

16 Hadoop is good for.. Batch data processing, not real-time / user facing Log Processing Document/Media Forensic Analysis and Indexing/Meta-data Tagging Web Graphs and Crawling Highly parallel data intensive distributed applications Bandwidth to data is a constraint Number of CPUs is a constraint Usually, very large production deployments (Grid) Several clusters of 1000s of nodes LOTS of data (Trillions of records, 100 TB+ data sets) 16

Indexing/Meta-data Tagging Web Graphs and Crawling Highly parallel data intensive distributed applications

17 Hadoop is Scalable: Store and process Petabytes (10 15 Bytes) of data on thousands on nodes Much larger than RAM, even single disk capacity Scale by adding Hardware (HW) Economical: Use commodity components when possible Lash thousands of these into an effective compute and storage platform Commercial platform service providers, e.g. Amazon EC2, Sun & Microsoft are established players in Cloud Computing space Reliable: data is replicated, failed tasks are rerun In a large enough cluster something is always broken Engineering reliability into every app is expensive: Redundancy is built into the system Efficient: runs tasks where data is located 17

18 Open Source Apache Project Hadoop brings MapReduce to everyone It s an Open Source Apache project Written in Java Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware Hadoop vastly simplifies cluster programming Distributed File System - distributes data Map/Reduce - distributes application 18

Solaris Commodity hardware Hadoop vastly simplifies cluster programming

19 Yahoo! MapReduce - Building an Inverted Index Machine1 Animals: 1,3 Dog: 3 Animals: 1,3 Animals:2,12 Bees:23 Animals: 1,2,3,12 Bees:23 Machine4 Machine4 Animals:2,12 Bees: 23 Machine2 Dog: 3 Dog:9 Farmer1: 7 Dog: 3,9 Farmer1: 7 Dog:9 Farmer1: 7 Machine5 Machine5 Machine3 Input split Map intermediate output (sorted) Shuffle Reduce 19

Animals:2,12 Bees:23 Animals: 1,2,3,12 Bees:23 Machine4 Machine4 Animals:2,12 Bees:

20 Yahoo! & Public Commercial Grid Services Yahoo! operates multiple grid clusters 10,000s nodes, 100,000s cores, TBs RAM, PBs disk Support large internal user community Manage data needs (Ingest TBs per day) Deploy and manage software (Hadoop, Pig, etc) Other Public Clouds: Amazon s EC2 Cloud Infrastructure / S3 Google Cloud Microsoft Cloud Azure Platform 20 IBM Cloud GoGrid, Rackspace, etc.

internal user community Manage data needs (Ingest TBs per day) Deploy and manage software

21 Yahoo! MapReduce Benchmarks May 11, 2009 Latest News Flash Benchmarks on Yahoo's Hammer cluster. Hammer's hardware is very similar to the hardware that we used in last year's terabyte sort. The hardware and operating system details are: Approx nodes, 2 quad core 2.5ghz per node 4 SATA disks per node, 16G RAM per node (for petabyte sort) 1 gigabit ethernet on each node, 40 nodes per rack 8 gigabit ethernet uplinks from each rack to the core Red Hat Enterprise Linux Server Kernel,Sun Java JDK The best times we observed were: Bytes Nodes Maps Reduces Replication Time 500,000,000, seconds 1,000,000,000, seconds 100,000,000,000, ,000 10, minutes 1,000,000,000,000, ,000 20, minutes 21

22 For more information: Website: Hadoop References Wiki: (for developers): Mailing lists: IRC: #hadoop on irc.freenode.org Yahoo s Hadoop blog: 22

23 Advancing Front Grid Generation Unstructured Mesh 23

24 Remeshing vs Refinement Logic Remeshing Procedures Refinement Procedures 24

25 Remeshing with Adaptive Refinement FEM Advantages Disadvantages Remeshing Refinement Remeshing + Adaptive Refinement New mesh has directional characteristics and controllable element spacing. Based on classic h-enrichment concept, does not need to go through remeshing. Effective for transient flow problems. Flexibility of multi-level of refinement at regions of interests within desired number of elements level. Very effective for transient flows. De-refinement can be done by space warp! Remeshing is CPU time consuming & not effective for transient flow problems. Based on setting of mesh spacing parameters & allowable sizes. Increase in number of new elements generated can be substantial. Does not exhibit directional characteristics. 25

26 Refinement vs Remeshing Techniques 26

27 The End Questions? Thank You! 27

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb