BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources



Similar documents
Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

Hadoop & Spark Using Amazon EMR

2) Xen Hypervisor 3) UEC

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

ur skills.com

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Cloud computing - Architecting in the cloud

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Real Time Big Data Processing

Scalable Application. Mikalai Alimenkou

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Apache Hadoop. Alexandru Costan

AIST Data Symposium. Ed Lenta. Managing Director, ANZ Amazon Web Services

Databricks. A Primer

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

CONDOR CLUSTERS ON EC2

Introduction to Big Data Training

Securing Data in the Virtual Data Center and Cloud: Requirements for Effective Encryption

McAfee Public Cloud Server Security Suite

Scalability of Master-Worker Architecture on Heroku

Cloud3DView: Gamifying Data Center Management

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Adam Barker

Cloud Computing. Chapter 1 Introducing Cloud Computing

Workshop on Hadoop with Big Data

APPLICATION NOTE. Elastic Scalability. for HetNet Deployment, Management & Optimization

Cloud Models and Platforms

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Putchong Uthayopas, Kasetsart University

Databricks. A Primer

Apache Hama Design Document v0.6

Financial Services Grid Computing on Amazon Web Services January 2013 Ian Meyers

Introduction to AWS in Higher Ed

Hadoopizer : a cloud environment for bioinformatics data analysis

OpenShift on you own cloud. Troy Dawson OpenShift Engineer, Red Hat November 1, 2013

Ironfan Your Foundation for Flexible Big Data Infrastructure

HDFS Cluster Installation Automation for TupleWare

Big Workflow: More than Just Intelligent Workload Management for Big Data

A Service for Data-Intensive Computations on Virtual Clusters

Cheminformatics in the Cloud. Michael A. Dippolito DeltaSoft, Inc. 3-June-2009 ChemAxon European User Group Meeting

CiteSeer x in the Cloud

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Introduction to Cloud Computing

Windows Azure platform What is in it for you? Dominick Baier Christian Weyer

Amazon Web Services Student Tutorial

Cloud Computing Training

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

T Mobile Cloud Computing Private Cloud & Assignment

Automated Application Provisioning for Cloud

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Viswanath Nandigam Sriram Krishnan Chaitan Baru

"Build and Test in the Cloud "

AWS Service Catalog. User Guide

Cloud Computing. Chapter 1 Introducing Cloud Computing

Subash Krishnaswamy Applications Software Technology Corporation

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

Professional Hadoop Solutions

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

This Symposium brought to you by

Hadoop Setup. 1 Cluster

Alfresco Enterprise on AWS: Reference Architecture

ediscovery and Search of Enterprise Data in the Cloud

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Team: May15-17 Advisor: Dr. Mitra. Lighthouse Project Plan Client: Workiva Version 2.1

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Client Overview. Engagement Situation. Key Requirements

The Total Cost of (Non) Ownership of a NoSQL Database Cloud Service

ArcGIS for Server in the Cloud

Federated Cloud-based Big Data Platform in Telecommunications

Building your Big Data Architecture on Amazon Web Services

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hitachi Data Center Analytics

Open source large scale distributed data management with Google s MapReduce and Bigtable

Foundations of Business Intelligence: Databases and Information Management

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

The OpenNebula Standard-based Open -source Toolkit to Build Cloud Infrastructures

USER CONFERENCE 2011 SAN FRANCISCO APRIL Running MarkLogic in the Cloud DEVELOPER LOUNGE LAB

Plug-and-play Virtual Appliance Clusters Running Hadoop. Dr. Renato Figueiredo ACIS Lab - University of Florida

Distributed Apriori in Hadoop MapReduce Framework

CHAPTER 8 CLOUD COMPUTING

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Cloud Computing. Lecture 24 Cloud Platform Comparison

Introduction to DevOps on AWS

Transcription:

BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources Raúl Ramos-Pollán, Fabio González, Juan C. Caicedo, Angel Cruz- Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez, Jose D. Bermeo, Juan S. Otálora, Paola K. Rozo, John E. Arévalo Bioingenium Research Group Universidad Nacional de Colombia 1

1 Motivation Motivation 2

1 Motivation Motivation State of the art computer vision algorithms may be computationally expensive 3

1 Motivation Size of Image Databases 1 icon = 2,000 images Less computing Resources 2,600 25,000 230,000 More computing Resources 4

1 Motivation Feature Extraction Pipelines INPUT IMAGES PATCH EXTRACTION FEATURES EXTRACTION FEATURES CLUSTERING 5

1 Motivation Motivation How to take advantage of seldom availability of computing resources? 6

2 BIGS Framework Strategy Build on experiences with Hadoop / Grid, etc. Profit from whatever resources available Decouple algorithm design from deployment Adopt Big Data principles and technologies Streamline software development process Unify coherent algorithms repository 7

2 BIGS Framework BIGS Framework EXPERIMENTER PIPELINE DEFINITION STAGE 1 STAGE 2 STAGE 4 STAGE 3 INPUT OUTPUT DATA WORKER WORKER SCHEDULE GENERATOR CLIENT at EXPERIMENTER S DESKTOP SCHEDULE NOSQL DATABASE WORKER ADDED LATER WORKERS (AMAZON, DESKTOPs) 8

2 BIGS Framework Pipeline definition example ####################################### # FIRST STAGE: Patch Sampling ####################################### stage.01.task: ROIsFeatureExtractionTask stage.01.numberofpartitions: 10 stage.01.roialgorithm: RandomPatchExtractor stage.01.fealgorithm: GrayHistogram stage.01.randompatchextractor.blocksize: 18 stage.01.randompatchextractor.size: 256 stage.01.input.table: corel5k.imgs stage.01.output.table: corel5k.sample #################################### # THIRD STAGE: Bag of Features Histograms #################################### stage.03.task: BagOfFeaturesExtractionTask stage.03.numberofpartitions: 10 stage.03.maximagesize: 256 stage.03.minimagesize: 20 stage.03.indexcodebook: true stage.03.featureextractor: GrayHistogram stage.03.patchextractor: RegularGridPatchExtractor stage.03.spatiallayoutextractor: IdentityROIExtractor stage.03.codebookid: STAGE-OUTPUT 2 ########################################### # SECOND STAGE: CodeBook Construction ########################################### stage.02.task: KMeans stage.02.kmeans.kvalue: 10 stage.02.kmeans.maxnumberofiterations: 5 stage.02.kmeans.minchangepercentage: 0.1 stage.02.kmeans.numberofpartitions: 10 stage.03.regulargridpatchextractor.blocksize: 18 stage.03.regulargridpatchextractor.stepsize: 9 stage.03.input.table: corel5k.imgs stage.03.output.table: corel5k.histograms stage.02.input.table: corel5k.sample stage.02.output.table: corel5k.codebook SCHEDULE OF JOBS 9

2 BIGS Framework Framework architecture TASKS API TASK PATTERN API WORKER API KMEANS BAG of FEATURES SCHEDULES INPUT/OUTPUT DATASETS UNDERLYING NoSQL STORAGE ITERATIVE DATA PARTITION DYNAMO DB IMPL HBASE IMPL FILE SYSTEM IMPL STORAGE API CORE schedule generator data management jobs wrapper WORKER AMZON VM LAUNCHER JWS LAUNCHER CMD LINE LAUNCHER CMD LINE TOOL WEB CONSOLE CLIENT API EXPERIMENTER 10

SAME PROCESS SAME PROCESS SAME PROCESS PARTITION 1 PARTITION 2 PARTITION N 2 BIGS Framework Algorithms patterns BEFORE PROCESSING ALL PARTITIONS START ITERATION PARTITION 1 PARTITION 2 PARTITION N BEFORE PROCESSING ALL PARTITIONS START PARTITION START PARTITION START PARTITION PROCESS DATA ITEM PROCESS DATA ITEM PROCESS DATA ITEM FINALIZE PARTITION FINALIZE PARTITION FINALIZE PARTITION AFTER PROCESSING ALL PARTITIONS AFTER PROCESSING ALL PARTITIONS FINALIZE ITERATION OUTPUT DATA PARTITION PROCESS DATA ITEM INPUT DATA PARTITION DATA PARTITION ITERATIVE ALGORITHMS ONLINE ALGORITHMS 11

3 Experimentation Experimental Evaluation ImageCLEFmed 2012 Database Split in three subsets: 3K: 1% 30k: 10% 300k: 100% 12

3 Experimentation First Experiment University Computer Room for Students 30K dataset Java Web Start Up to 40 Workers 13

3 Experimentation Results Elapsed and compute times (min) 14

3 Experimentation Second Experiment Amazon EC2 AMAZON CLOUD WORKER WORKER AMAZON LOCAL NETWORK WORKER WORKER 3K dataset S3 Dynamo DB WORKER EC2 VM Up to 20 Workers PIPELINES SCHEDULES DATASETS LOCAL FRAMEWORK WEB BROWSER AMAZON API MANAGE WORKERS (LAUNCH, SHUTDOWN) EXPERIMENTER 15

3 Experimentation Cloud usage # prepare files with AWS credentials and desired pipeline > load.imgs images /home/me/clef/images 306,530 files loaded into table images > pipeline.load bof.pipeline pipeline successfully loaded. pipeline number is 4 > pipeline.prepare 4 pipeline generated 56 jobs > aws.launchvms ami-30495 15 launching 15 instances of ami-30495 > pipeline.info 4.. > aws.termvms ami-30495 5 terminating 5 instances of ami-30495 > pipeline.prepare 4 16

3 Experimentation Results Elapsed and compute times 17

3 Experimentation Results Empirical and theoretical speedup 18

3 Experimentation Third Experiment Servers and desktops at our lab 300K dataset Command line 40 Workers 19

3 Experimentation Results Times required to process the 300k dataset 20

3 Experimentation Results Adhoc image retrieval task 1st out of 53 experiments (textual modality) 3rd out of 36 experiments (visual modality) Label: UNAL http://www.imageclef.org/2012/medical 21

4 Conclusions Conclusions Tool for supporting the full image processing lifecycle First scalability behavior on Amazon Cloud Decoupled algorithm design from deployment Adapted to our reality grasp whatever computer resources available 22

4 Conclusions Conclusions Streamlined internal software process Unified software repository AGILE EXPERIMENTATION LIFECYCLE LITTLE PRIOR KNOWLEDGE ON COMPUTING RESOURCES AVAILABLE 23

4 Conclusions Future work Extend support to additional machine learning processes Larger scalability experiments (# workers) Optimize Amazon usage (DynamoDB) Better understand Amazon costs patterns Experience with NoSQL scalability (HBase) Better understand limitations of each deployment model (opportunistic, cluster, cloud) Move to >1M image datasets 24

Thank You! BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources Raúl Ramos-Pollán, Fabio González, Juan C. Caicedo, Angel Cruz- Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez, Jose D. Bermeo, Juan S. Otálora, Paola K. Rozo, John E. Arévalo Bioingenium Research Group Universidad Nacional de Colombia 25