Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage Cyrus Shahabi, Ph.D. Professor of Computer Science & Electrical Engineering Director, Integrated Media Systems Center (IMSC) Director, VSoE Informatics Viterbi School of Engineering University of Southern California Los Angeles, CA 900890781 shahabi@usc.edu 1
OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 2
OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 3
Cloud CompuBng Cloud compubng is the delivery of compubng and storage resources as a service across Internet to mulbple external customers through massive scale data centers. Some stabsbcs 51% of all global workloads in 2014 were processed in cloud versus tradibonal IT space 1. IBM Big Blue cloud project generated $7 billion revenue in 2014, up 75% from the previous year 2. By 2020, it is esbmated that 80% of small businesses in US will use cloud compubng, up from 37% in 2014. [1] Cisco, hyp://newsroom.cisco.com/release/ 1274405 [2] hyp://talkincloud.com/cloud- compubng- funding- and- finance/01202015/ibm- q4- earnings- cloud- revenues- hit- 7b- 2014 4
Advantages ü Reduced Cost Cloud CompuBng eliminabon of the investment in stand- alone so_ware or servers ü Scalability and ElasBcity providing on- demand resources instantaneously ü Availability downbme is very small throughout year ü Quick deployment minimum effort in integrabng applicabon ü Environment friendly less cooling cost per server, more ublizabon 5
Cloud CompuBng Disadvantages ü Security and Privacy by leveraging a remote cloud based infrastructure, a company essenbally gives away private data and informabon ü Dependency and Vendor lock- in implicit dependency on the provider ü Limited Flexibility since the applicabons and services run on remote, third party virtual environments, users have limited control over the hardware and so_ware ü Increased Vulnerability since cloud based solubons are exposed on the public internet and are thus a more vulnerable target for malicious users and hackers. 6
Cloud CompuBng Market shares of big players 7
Cloud CompuBng - Pricing Virtual Machines (Servers) Servers are grouped into certain categories such as disk- op(mized, memory- op(mized, CPU- op(mized, GPU. Each server group consists of mulbple servers Note: smallest means the server with the lowest configurabon in that group Group Amazon Microso8 Google price ($/hour) price ($/hour) price ($/hour) smallest largest smallest largest smallest largest General purpose 0.07 0.56 0.02 0.72 0.077 1.232 Compute op<mized 0.105 1.68 2.45 4.9 0.096 0.768 Memory op<mized 0.175 2.8 0.33 1.32 0.18 1.44 Disk op<mized 0.853 6.82 - - - - Micro 0.02 0.044 - - 0.014 0.0385 GPU 0.65 0.65 - - - - 8
OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 9
Traffic Data Lifecycle: Data Aggregator An Exclusive Contract w LA- Metro (2010) Data Type Sample XML File Size (in KB) Variety (gps, video, loop Cycle Duration (in seconds) Minute (in KB) Hourly (in KB) Daily (in KB) Annual (in KB) 3 Years (in KB) sensor, events) bus_mta_inv2.xml 23 86400 0.96 0.96 23.00 8,395.00 25,185.00 bus_mta_rt2.xml 1065 120 532.50 31,950.00 766,800.00 279,882,000.00 839,646,000.00 cctv_inv.xml 57 86400 0.04 2.38 57.00 20,805.00 62,415.00 cms_inv.xml 52 86400 0.04 2.17 52.00 18,980.00 56,940.00 cms_rt.xml 48 75 38.40 2,304.00 55,296.00 20,183,040.00 60,549,120.00 event_d7.xml 11 75 8.80 528.00 12,672.00 4,625,280.00 13,875,840.00 rail_mta_inv.xml 1 86400 0.00 0.04 1.00 365.00 1,095.00 rail_rt.xml 8 60 8.00 480.00 11,520.00 4,204,800.00 12,614,400.00 rms_inv.xml 865 86400 0.60 36.04 865.00 315,725.00 947,175.00 rms_rt.xml 1236 75 988.80 59,328.00 1,423,872.00 519,713,280.00 1,559,139,840.00 signal_inv.xml 2095 86400 1.45 87.29 2,095.00 764,675.00 2,294,025.00 signal_rt.xml 2636 45 3,514.67 210,880.00 5,061,120.00 1,847,308,800.00 5,541,926,400.00 tt_d7_inv.xml 746 86400 0.52 31.08 746.00 272,290.00 816,870.00 tt_d7_rt.xml 152 60 152.00 9,120.00 218,880.00 79,891,200.00 239,673,600.00 vds_art_d7_inv.xml 115 86400 0.08 4.79 115.00 41,975.00 125,925.00 Velocity vds_art_d7_rt.xml 45 60 45.00 2,700.00 64,800.00 23,652,000.00 70,956,000.00 vds_art_ladot_inv.xml 2538 86400 1.76 105.75 2,538.00 926,370.00 2,779,110.00 vds_art_ladot_rt.xml 969 60 969.00 58,140.00 1,395,360.00 509,306,400.00 1,527,919,200.00 vds_fr_d7_inv.xml 957 86400 0.66 39.88 957.00 349,305.00 1,047,915.00 vds_fr_d7_rt.xml 361 30 722.00 43,320.00 1,039,680.00 379,483,200.00 1,138,449,600.00 Total KB from XML data 13980 864660 6,985.28 419,060.38 10,057,449.00 3,670,968,885.00 11,012,906,655.00 Volume 10
TransDec: Big data acquisibon, storage & access Input Traffic Data Data Processing Storage Retrieval, Analysis &VisualizaBon Sensor 4 46 MB/min 26 15 MB/min TB/Year Sensor 3 Sensor 2 Highway (4313) Arterial (4780) Real- <me Queries & Bus & Rail (2000) Data Cleansing Ramp meter Events & CMS (800/day) Spa<otemporal Indexing (Oracle Award, IEEE CloudCom Best paper) Sensor 1 Event LocaBon E.g., Accident impact analysis & predic<on (ICDM 12 & 13) 11
OUTLINE Some Backgrounds Cloud CompuBng An Example IMSC s TransDec A Proposal USC DataLab 12
Berkeley Data AnalyBcs Stack - BDAS- BDAS: BDAS is an open source so8ware stack that integrates open- source so_ware components to make sense of Big Data. A High Level overview of BDAS Components Data Processing Data Management Resource Management 13
Berkeley Data AnalyBcs Stack BDAS More in Depth - BDAS- Numerous available open source packages for: - Machine Learning (MLlib) - Graph analysis (GraphX) - Real- <me Analysis (Streaming) BigData applica<ons for various domains Flexible intercommunica<on between layers Unlimited expansion Many more projects to come 14
USC DataLab Create a shared repository of USC data & code for research (on BDAS) Example: Security- related Datasets CCTV videos from DPS Mobile videos from any individuals Sensor Readings from Buildings from Facility Management Crime Reports from DPS ShuYle bus routes/locabons from USC TransportaBon Security patrol cars/ambassadors locabons from DPS Events from various sources Crowdsourced data from USC community Shared So_ware (for data analysis such as object recognibon) from USC community 15
Backup Slides 16
Cloud CompuBng - Amazon Virtual Machines 28 different types of servers Big Data analysis for both offline and stream data. Services: ElasBc MapReduce, Kinesis, RedShi_ Scalable NoSQL databases Service: DynamoDB TradiBonal relabonal databases Service: RDS File and Object Storage Service: S3 17
Cloud CompuBng - Microso_ Virtual Machines 18 different types of servers Big Data analysis for both offline and stream data. Services: HDInsight Scalable NoSQL databases Service: Windows Azure Table TradiBonal relabonal databases Service: SQL server File and Object Storage Service: Windows Azure 18
Cloud CompuBng - Google Virtual Machines 15 different types of servers Big Data analysis for both offline and stream data. Services: Big Query, Hadoop Scalable NoSQL databases Service: Cloud Datastore TradiBonal relabonal databases Service: Google Cloud SQL File and Object Storage Service: Google Cloud Storage 19
Berkeley Data AnalyBcs Stack - BDAS- BDAS Important Components Mesos A cluster management layer Resource management and scheduling across enbre datacenters and cloud environments Spark An in- memory, distributed, fault- tolerant processing framework Data sharing enabled compared to Map Reduce In- memory solu<on, extremely faster for tasks that boyleneck on disk I/O in MapReduce MulBple running packages on top of Spark Core (Spark SQL, SPARK MLlib, SPARK Streaming) Tachyon Fault- tolerant, memory- centric distributed file system Tachyon caches working set files in memory Avoids going to disk to load datasets that are frequently read Provides memory level response <mes for frequently accessed data 20
InstallaBon Dependencies Berkeley Data AnalyBcs Stack - BDAS- BDAS can be installed on any cloud provider with Amazon Cloud Google Cloud Microso_ Azure Cloud Private Cloud HPC So8ware Requirements Runs on both Windows and Unix- like systems (Centos, RHEL, Mac OS) Produc<on Requirements Memory per machine good behavior documented from 8GB to hundred CPU cores per machine provide at least 8-16 cores 21