Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage



Similar documents
Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Real Time Big Data Processing

Assignment # 1 (Cloud Computing Security)

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Cloud Computing and Amazon Web Services

Amazon EC2 Product Details Page 1 of 5

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Oracle Big Data SQL Technical Update

Next-Gen Big Data Analytics using the Spark stack

Big-Data Computing with Smart Clouds and IoT Sensing

Moving From Hadoop to Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Big Data Research in the AMPLab: BDAS and Beyond

Apache Hadoop. Alexandru Costan

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

The Berkeley AMPLab - Collaborative Big Data Research

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

NoSQL Data Base Basics

Google Cloud Platform The basics

Traditional v/s CONVRGD

Introduction to Cloud Computing

Large-Scale Data Processing

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

How To Handle Big Data With A Data Scientist

Scalable Architecture on Amazon AWS Cloud

CS 294: Big Data System Research: Trends and Challenges

Big Data Processing: Past, Present and Future

Hadoop & its Usage at Facebook

Hadoop Ecosystem B Y R A H I M A.

PaaS - Platform as a Service Google App Engine

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data and Industrial Internet

Big Data Technologies Compared June 2014

Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

BIG DATA TRENDS AND TECHNOLOGIES

Cloud Computing For Bioinformatics

How Companies are! Using Spark

Dedicated Hosting. The best of all worlds. Build your server to deliver just what you want. For more information visit: imcloudservices.com.

Hadoop & Spark Using Amazon EMR

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Cloud Courses Description

Big Data Analytics. Chances and Challenges. Volker Markl

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Learning Management Redefined. Acadox Infrastructure & Architecture

The Future of Data Management

How To Scale Out Of A Nosql Database

Session 11 : (additional) Cloud Computing Advantages and Disadvantages

IOS110. Virtualization 5/27/2014 1

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

A survey of big data architectures for handling massive data

Managing your Red Hat Enterprise Linux guests with RHN Satellite

COMP9321 Web Application Engineering

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Cloud Design and Implementation. Cheng Li MPI-SWS Nov 9 th, 2010

A survey on platforms for big data analytics

Big Data on Microsoft Platform

Nutanix Solutions for Private Cloud. Kees Baggerman Performance and Solution Engineer

Big Data for Big Intel

Dell In-Memory Appliance for Cloudera Enterprise

Big Data Integrated Media Systems Center h#p://imsc.usc.edu/

NoSQL for SQL Professionals William McKnight

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Big Data. Lyle Ungar, University of Pennsylvania

SQream Technologies Ltd - Confiden7al

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

From Spark to Ignition:

Using RDBMS, NoSQL or Hadoop?

BIG DATA ANALYTICS For REAL TIME SYSTEM

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Amazon Web Services Yu Xiao

Cloud Computing. Adam Barker

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Gladinet Cloud Enterprise

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

NextGen Infrastructure for Big DATA Analytics.

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

Big Data and Analytics: Challenges and Opportunities

How To Understand Cloud Computing

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

Hadoop in the Hybrid Cloud

DLT Solutions and Amazon Web Services

Transcription:

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage Cyrus Shahabi, Ph.D. Professor of Computer Science & Electrical Engineering Director, Integrated Media Systems Center (IMSC) Director, VSoE Informatics Viterbi School of Engineering University of Southern California Los Angeles, CA 900890781 shahabi@usc.edu 1

OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 2

OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 3

Cloud CompuBng Cloud compubng is the delivery of compubng and storage resources as a service across Internet to mulbple external customers through massive scale data centers. Some stabsbcs 51% of all global workloads in 2014 were processed in cloud versus tradibonal IT space 1. IBM Big Blue cloud project generated $7 billion revenue in 2014, up 75% from the previous year 2. By 2020, it is esbmated that 80% of small businesses in US will use cloud compubng, up from 37% in 2014. [1] Cisco, hyp://newsroom.cisco.com/release/ 1274405 [2] hyp://talkincloud.com/cloud- compubng- funding- and- finance/01202015/ibm- q4- earnings- cloud- revenues- hit- 7b- 2014 4

Advantages ü Reduced Cost Cloud CompuBng eliminabon of the investment in stand- alone so_ware or servers ü Scalability and ElasBcity providing on- demand resources instantaneously ü Availability downbme is very small throughout year ü Quick deployment minimum effort in integrabng applicabon ü Environment friendly less cooling cost per server, more ublizabon 5

Cloud CompuBng Disadvantages ü Security and Privacy by leveraging a remote cloud based infrastructure, a company essenbally gives away private data and informabon ü Dependency and Vendor lock- in implicit dependency on the provider ü Limited Flexibility since the applicabons and services run on remote, third party virtual environments, users have limited control over the hardware and so_ware ü Increased Vulnerability since cloud based solubons are exposed on the public internet and are thus a more vulnerable target for malicious users and hackers. 6

Cloud CompuBng Market shares of big players 7

Cloud CompuBng - Pricing Virtual Machines (Servers) Servers are grouped into certain categories such as disk- op(mized, memory- op(mized, CPU- op(mized, GPU. Each server group consists of mulbple servers Note: smallest means the server with the lowest configurabon in that group Group Amazon Microso8 Google price ($/hour) price ($/hour) price ($/hour) smallest largest smallest largest smallest largest General purpose 0.07 0.56 0.02 0.72 0.077 1.232 Compute op<mized 0.105 1.68 2.45 4.9 0.096 0.768 Memory op<mized 0.175 2.8 0.33 1.32 0.18 1.44 Disk op<mized 0.853 6.82 - - - - Micro 0.02 0.044 - - 0.014 0.0385 GPU 0.65 0.65 - - - - 8

OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 9

Traffic Data Lifecycle: Data Aggregator An Exclusive Contract w LA- Metro (2010) Data Type Sample XML File Size (in KB) Variety (gps, video, loop Cycle Duration (in seconds) Minute (in KB) Hourly (in KB) Daily (in KB) Annual (in KB) 3 Years (in KB) sensor, events) bus_mta_inv2.xml 23 86400 0.96 0.96 23.00 8,395.00 25,185.00 bus_mta_rt2.xml 1065 120 532.50 31,950.00 766,800.00 279,882,000.00 839,646,000.00 cctv_inv.xml 57 86400 0.04 2.38 57.00 20,805.00 62,415.00 cms_inv.xml 52 86400 0.04 2.17 52.00 18,980.00 56,940.00 cms_rt.xml 48 75 38.40 2,304.00 55,296.00 20,183,040.00 60,549,120.00 event_d7.xml 11 75 8.80 528.00 12,672.00 4,625,280.00 13,875,840.00 rail_mta_inv.xml 1 86400 0.00 0.04 1.00 365.00 1,095.00 rail_rt.xml 8 60 8.00 480.00 11,520.00 4,204,800.00 12,614,400.00 rms_inv.xml 865 86400 0.60 36.04 865.00 315,725.00 947,175.00 rms_rt.xml 1236 75 988.80 59,328.00 1,423,872.00 519,713,280.00 1,559,139,840.00 signal_inv.xml 2095 86400 1.45 87.29 2,095.00 764,675.00 2,294,025.00 signal_rt.xml 2636 45 3,514.67 210,880.00 5,061,120.00 1,847,308,800.00 5,541,926,400.00 tt_d7_inv.xml 746 86400 0.52 31.08 746.00 272,290.00 816,870.00 tt_d7_rt.xml 152 60 152.00 9,120.00 218,880.00 79,891,200.00 239,673,600.00 vds_art_d7_inv.xml 115 86400 0.08 4.79 115.00 41,975.00 125,925.00 Velocity vds_art_d7_rt.xml 45 60 45.00 2,700.00 64,800.00 23,652,000.00 70,956,000.00 vds_art_ladot_inv.xml 2538 86400 1.76 105.75 2,538.00 926,370.00 2,779,110.00 vds_art_ladot_rt.xml 969 60 969.00 58,140.00 1,395,360.00 509,306,400.00 1,527,919,200.00 vds_fr_d7_inv.xml 957 86400 0.66 39.88 957.00 349,305.00 1,047,915.00 vds_fr_d7_rt.xml 361 30 722.00 43,320.00 1,039,680.00 379,483,200.00 1,138,449,600.00 Total KB from XML data 13980 864660 6,985.28 419,060.38 10,057,449.00 3,670,968,885.00 11,012,906,655.00 Volume 10

TransDec: Big data acquisibon, storage & access Input Traffic Data Data Processing Storage Retrieval, Analysis &VisualizaBon Sensor 4 46 MB/min 26 15 MB/min TB/Year Sensor 3 Sensor 2 Highway (4313) Arterial (4780) Real- <me Queries & Bus & Rail (2000) Data Cleansing Ramp meter Events & CMS (800/day) Spa<otemporal Indexing (Oracle Award, IEEE CloudCom Best paper) Sensor 1 Event LocaBon E.g., Accident impact analysis & predic<on (ICDM 12 & 13) 11

OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 12

Berkeley Data AnalyBcs Stack - BDAS- BDAS: BDAS is an open source so8ware stack that integrates open- source so_ware components to make sense of Big Data. A High Level overview of BDAS Components Data Processing Data Management Resource Management 13

Berkeley Data AnalyBcs Stack BDAS More in Depth - BDAS- Numerous available open source packages for: - Machine Learning (MLlib) - Graph analysis (GraphX) - Real- <me Analysis (Streaming) BigData applica<ons for various domains Flexible intercommunica<on between layers Unlimited expansion Many more projects to come 14

USC DataLab Create a shared repository of USC data & code for research (on BDAS) Example: Security- related Datasets CCTV videos from DPS Mobile videos from any individuals Sensor Readings from Buildings from Facility Management Crime Reports from DPS ShuYle bus routes/locabons from USC TransportaBon Security patrol cars/ambassadors locabons from DPS Events from various sources Crowdsourced data from USC community Shared So_ware (for data analysis such as object recognibon) from USC community 15

Backup Slides 16

Cloud CompuBng - Amazon Virtual Machines 28 different types of servers Big Data analysis for both offline and stream data. Services: ElasBc MapReduce, Kinesis, RedShi_ Scalable NoSQL databases Service: DynamoDB TradiBonal relabonal databases Service: RDS File and Object Storage Service: S3 17

Cloud CompuBng - Microso_ Virtual Machines 18 different types of servers Big Data analysis for both offline and stream data. Services: HDInsight Scalable NoSQL databases Service: Windows Azure Table TradiBonal relabonal databases Service: SQL server File and Object Storage Service: Windows Azure 18

Cloud CompuBng - Google Virtual Machines 15 different types of servers Big Data analysis for both offline and stream data. Services: Big Query, Hadoop Scalable NoSQL databases Service: Cloud Datastore TradiBonal relabonal databases Service: Google Cloud SQL File and Object Storage Service: Google Cloud Storage 19

Berkeley Data AnalyBcs Stack - BDAS- BDAS Important Components Mesos A cluster management layer Resource management and scheduling across enbre datacenters and cloud environments Spark An in- memory, distributed, fault- tolerant processing framework Data sharing enabled compared to Map Reduce In- memory solu<on, extremely faster for tasks that boyleneck on disk I/O in MapReduce MulBple running packages on top of Spark Core (Spark SQL, SPARK MLlib, SPARK Streaming) Tachyon Fault- tolerant, memory- centric distributed file system Tachyon caches working set files in memory Avoids going to disk to load datasets that are frequently read Provides memory level response <mes for frequently accessed data 20

InstallaBon Dependencies Berkeley Data AnalyBcs Stack - BDAS- BDAS can be installed on any cloud provider with Amazon Cloud Google Cloud Microso_ Azure Cloud Private Cloud HPC So8ware Requirements Runs on both Windows and Unix- like systems (Centos, RHEL, Mac OS) Produc<on Requirements Memory per machine good behavior documented from 8GB to hundred CPU cores per machine provide at least 8-16 cores 21