Big Data and the Cloud IJIS Institute Emerging Technologies 2012 Thomas Herzog, Associate Commissioner, New York State
What is Big Data? CTA 2012 - May 21-23, 2012 Daytona Beach, FL
What is Big Data? Data sets that exceed the boundaries and sizes of normal processing capabilities, forcing the use of non traditional approaches Velocity: I/O Processing Normal Processing Capabilities Big Data Volume: File/Object Size
What is Big Data? Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data effectively Big data sizes currently range from a few dozen terabytes to many petabytes of data in a single data set. CTA 2012 - May 21-23, 2012 Daytona Beach, FL
The 3V Factor Model of Big Data Volume amount of data Velocity speed of data in/out Variety range of data types, sources Volume Variety Velocity CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Volume North America >3,500 Europe >2,000 China >250 Middle East >200 India >400 >50 Japan >50 South America Amount of Big Data stored across the world (in Petabytes)
Variety User to User User to Machine Machine to Machine email, Web logs, Virtual Communities, Social Networking Archives, Medical Records, Digital TV, ecommerce, Smart Cards, Bank Cards, Computers, Mobile Sensors, GPS Devices, Bar Codes, RFID, Scanners, Surveillance Video, Scientific Research,
Velocity 2.9 Million Emails sent every second 20 Hours New Video uploaded every minute 50 Million Tweets Per Day
Data: Big and Small Circa 1975 Transaction Data Circa 2010 Cloud Data 2,000 users = Huge 2,000 users = Tiny Smaller Data Sets (bytes) Highly Structured and Homogenous Data Relatively small tables Absolute consistency required Big Data (petabytes) Unstructured, complex blobs (images, voice, video, logs) does not constrain to tables, columns and rows Application responsiveness & scale trumps immediate consistency
Big Data Examples web logs RFID sensor networks social networks; social data Internet text and documents Internet search indexing call detail records genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance medical records photography & video archives large scale ecommerce.
How Big is BIG? Kilobytes (10 3 ) Megabytes (10 6 ) Gigabytes (10 9 ) Terabytes (10 12 ) Petabytes (10 15 ) Exabytes (10 18 ) Zettabytes (10 21 ) = 1,000,000,000,000,000,000,000 bytes = 1000 7 bytes = 10 21 bytes A zettabyte is equal to 1 billion terabytes. A zettabyte is equal to 1 million petabytes.
How Big is BIG? All the books in the Library of Congress = less than 10 terabytes
How Big is BIG? All the printed material in the world = less than 20 petabytes
How Big is BIG? Wal Mart handles more than 1 million customer transactions every hour which in turn imports into databases estimated at more than 2.5 petabytes The Sloan Digital Sky Survey collected more data in its first few weeks than the entire data collection in the history of astronomy back in the year 2000 CTA 2012 - May 21-23, 2012 Daytona Beach, FL
How Big is BIG? Internet: Google processes about 24 petabytes of data per day. Telecoms: AT&T transfers about 19 petabytes of data through their networks each day. Physics: The experiments in the Large Hadron Collider produce about 15 petabytes of data per year, which will be distributed over the LHC Computing Grid Neurology: It is estimated that the human brain's ability to store memories is equivalent to about 2.5 petabytes of binary data. Archives: The Internet Archive contains about 5.8 petabytes of data as of December 2010. It was growing at the rate of about 100 terabytes per month Games: World of Warcraft uses 1.3 petabytes of storage to maintain its game. Film: The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage for the rendering of the 3D CGI effects
How Big is BIG? Facebook handles 40 billion photos from its user base. Decoding the human genome originally took 10 years to process now it can be achieved in one week The estimated size of the digital universe in 2011 was 1.8 zettabytes. It is predicted that between now and 2020, this will grow 44 fold (or 35 zettabytes per year!
Utility vs Data Clouds Cloud Data Cloud Massively Parallel Computing Highly Scalable Multi Dimensional Databases Distributed Highly Fault Tolerant Massive Storage Computing Utility Cloud Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (Saas)
Utility vs. Data Clouds Utility Clouds Computing Services for outsource IT needs Concurrent, independent, multi tenant users Service offerings such as SaaS, PaaS and IaaS Characterized by data segmentation, hosted applications, low cost of ownership, and elasticity Data Clouds Computing architecture for large scale processing and analytics Designed to operate at trillions of operations/day, petabytes of storage Designed for performance, scale and processing Characterized by runtime data models and simplified development models
22 What is Cloud Computing?
23 What is Cloud Computing? (Business definition) A method to address scalability and availability concerns for large scale applications
24 What is Cloud Computing? (Engineering definition) Providing convenient, on demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction.
25 What is Cloud Computing? (The big picture) Internet is the democratization of information E commerce is about the democratization of business Democratized distributed computing America was about the democratization of government. Blogging is the democratization of news Cloud Computing is about the democratization of servers
26 Cloud Services Software as a Service SaaS A way to access applications hosted on the web through your web browser PaaS The delivery of a computing platform and solution stack as a service. A pay as you go model for IT resources accessed over the Internet IaaS Use of computer resources, distributed throughout an internet, to perform parallel processing, distributed storage, indexing and mining of data
Cloud Deployment Models Internal (private) cloud. The cloud infrastructure is operated within the consumer s organization, or external but exclusively used. Community cloud. The cloud infrastructure is jointly owned by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). Public cloud. The cloud infrastructure is owned by an organization selling cloud services to the general public or to a large industry group. Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (internal, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability. NIST working definitions CTA 2012 - May 21-23, 2012 Daytona Beach, FL 27
28 Why Move into the Cloud? Big for Little The access to infinite computing resources available on demand, thereby eliminating the need for users to plan far ahead for provisioning. Small agencies, with big system resources. Pay As You Go The elimination of an up front commitment by Cloud Users, thereby allowing agencies to start small and increase hardware resources only when there is an increase in their needs Elastic The ability to pay for use of computing resources on a short term basis as needed (e.g., processors by the hour and storage by the day) and release them as needed
29 Commercial Cloud Formation
Corrections In the Cloud Examples of where we are, or where we are going: Offender Phone and Messaging Solutions Video Visitation Email and Email Archiving Storage and Virtual Data Centers Issues: Agency Concerns over Data Ownership CJIS Requirements Subpoena s for Data Security and System Administrator Access to Data
Items for Further Discussion What are the rules over big data mining? Use Cases: Facebook for Gang Recruitment Use of Twitter to develop Social Media Profiles DashBoard and/or Constellation Analysis Monitoring/Aggregating/Engaging Technologies CTA 2012 - May 21-23, 2012 Daytona Beach, FL
32 With New Paradigms Come New Challenges Cloud Computing Security Portability Interoperability Big Data New Programmatic Models Blurring of Public, External and Internal Data Effective Use of Big Data
33 Cloud Standards Customer Council The goal of the council is to separate the hype from the reality on how to leverage what customers have today and how to use open, standards based cloud computing to extend their organizations. Key considerations outlined by the council: Security Portability Interoperability
34 Security in the Cloud A cloud implementation introduces security risks and at the same time security advantages. Selecting a migration path is a key strategy to reduce the risks and maximize the advantages
35 Security in the Cloud Security Challenges Moving personally identifiable and sensitive data to the cloud Trusting vendor s security model Data ownership issues and indirect administrator accountability Proprietary implementations cannot be examined Large clouds are attraction to hackers Possibility of massive outages Loss of physical control
36 Security in the Cloud Security Advantages Greater investment in security infrastructure Cloud homogeneity makes security auditing/testing simpler Clouds enable automated security management Simplification of compliance analysis Data held by an unbiased party Dedicated Security Team Redundancy / Disaster Recovery
37 Considerations on Migration Balancing Threat Exposure and Cost Effectiveness Private clouds have less threat exposure while massive public clouds are cost effective Leverage increasing amount of knowledge available from the Cloud Security Alliance (CSA) and the National Institute of Standards and Technology (NIST) Public data can be moved to the cloud today while higher sensitivity data is likely to be processed on clouds where organizations have control over the security model
38 Portability in the Cloud The Open Virtualization Format (OVF) is an industry standard format for portable virtual machines. Virtual machines packaged in this format can be installed on any virtualization platform that supports the standard. The companies behind the collaboration on this specification include Dell, HP, IBM, Microsoft, VMware, and XenSource.
Interoperability in the Cloud Standards are Foundational Elements and Enablers of Cloud Computing Interoperability Grid Computing and Server Virtualization Web Services and Service Oriented Architecture Federated Identity Management Service Level Agreements Justice and Public Safety Standards The National Information Exchange Model (NIEM) The Global Reference Architecture (GRA) The Global Federated Identity and Privilege Management (GFIPM) NIST has a dedicated research group CTA 2012 - May 21-23, 2012 Daytona Beach, FL 39
Global Advisory Committee - October 13, 2011 Washington DC 40 Clouds and JPS Standards Data interoperability based on NIEM is critical to cloud implementations. Clouds have the potential to further enable NIEM to become the basis upon which successful data sharing across federal, state, local and tribal government is achieved. The Global Reference Architecture (GRA) provides a model for using cloud services to compose complex, customizable, distributed applications The Global Federated Identity and Privilege Management (GFIPM) standard provides governance mechanism to establish trust across security domains and could be a critical enabler to security in the cloud.
Cloud Statistics If you move your data centre to a cloud provider, it will cost a tenth of the cost. Brian Gammage, Gartner Fellow Use of cloud applications can reduce costs from 50% to 90% CTO of Washington D.C. IT resource subscription pilot saw 28% cost savings Alchemy Plus cloud Preferred Hotels Traditional: $210k server refresh and $10k/month Cloud: $10k implementation George Reese, founder Valtira and enstratus Using cloud infrastructures saves 18% to 29% before considering that you no longer need to buy for peak capacity CTA 2012 - May 21-23, 2012 Daytona Beach, FL 41
A Strong Commitment to Cloud Computing The current administration has made cloud computing a high priority Considered the next generation of IT in government Supports the objective of creating a more agile federal enterprise, where services can be provisioned and reused on demand to meet business needs The advantages of cloud computing are so compelling, I don t think there is any going back The justice and public safety world is already developing capabilities to use this paradigm. Companies are offering Software as a Service using Nlets as the network cloud within which smaller police agencies can have systems without paying the cost of their own server and localized application software. Source: Federal government takes steps toward cloud computing environment, by Richard W. Walker Source: The IJIS Factor: When will cloud computing come of age?, by Paul Wormeli CTA 2012 - May 21-23, 2012 Daytona Beach, FL 42
43 The Evolution of Government Clouds Amazon Web Services GovCloud designed for sensitive workloads Managed by US Personnel Conformant with Government Specific Controls and Certifications Microsoft has announced a number of diverse offerings ranging from its Azure Appliance to a dedicated government cloud offering based on the Business Productivity Online Suite (BPOS) Google has announced completion of FISMA certification for a multi tenant cloud application and GoogleApps has received an authority to operate at the FISMA Moderate level Source: Government Clouds, by Tom Kooy, September 12 th, 2011
Cloud Case Studies Selective Service Systems The goal was to operate at maximum potential with reduced annual budget and at the same time achieve agility at a reduced total cost of ownership. Employed a phased approach migrating one system at a time and performed ongoing improvements of the cloud environment. Achieved the following key benefits Improved Database Analysis Performance Ease of Implementation Rapid System Deployment CTA 2012 - May 21-23, 2012 Daytona Beach, FL 44
Cloud Case Studies City of Miami Cuts Costs with Cloud Services The goal was to develop an online application to record, track and report on nonemergency 3 1 1 incidents to better serve citizens. The city was facing constraints such as tighter budget and fewer personnel The benefits were reduced cost, fast time to market, greater ability to offer new services to citizens and improved disaster recovery. With Windows Azure, we don t have to worry about managing a costly infrastructure and can focus on delivering new services that positively impact citizens and our organization. James Osteen, Assistant Director of Information Technology, City of Miami CTA 2012 - May 21-23, 2012 Daytona Beach, FL 45
Cloud Case Studies Cloud User: City of Washington D.C. Migrating 38,000 employees to Cloud Applications Replaced current software with: Gmail Google Docs (word processing and spreadsheets) Google video for business Google sites (intranet sites and wikis) 500,000+ organizations use these Apps It's a fundamental change to the way our government operates by moving to the cloud. Rather than owning the infrastructure, we can save millions. Vivek Kundra, Former Federal CIO CTA 2012 - May 21-23, 2012 Daytona Beach, FL 46
Cloud Case Studies President Obama s Citizen s Briefing Book Based on a Cloud Application Concept to Live in Three Weeks 134,077 Registered Users 1.4 M Votes 52,015 Ideas Peak traffic of 149 hits per second US Census Bureau Uses new Cloud Application Project implemented in under 12 weeks 2,500+ partnership agents used Cloud application for 2010 decennial census Allows projects to scale from 200 to 2,000 users overnight to meet peak periods with no capital expenditure Town of East Hampton New York RMS/CAD The goal was to leverage existing investments and at the same time to adopt an operating expense centric strategy which provided more flexibility. Key requirement was to be able to rapidly react to increased service requirements. The solution was to use an enterprise level CAD/RMS system that dynamically interacts with a Cloud platform and consists of large scale infrastructure, replication, load balancing, resource allocation and more. Key benefits are reduced hardware, software and maintenance cost, new functionality and improved productivity. CTA 2012 - May 21-23, 2012 Daytona Beach, FL 47
Emerging Technologies for Big Data Massively parallel processing (MPP) databases Data mining grids Distributed file systems Distributed databases Cloud computing platforms Scalable storage systems CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Big Data Analytics Scalability (the problem) Which are all of the relevant data sources? What are the characteristics of the different data sources? Where these data sources reside? What is the capacity of host platforms? What is the reliability of host platforms? How interoperability between host platforms is defined? CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Big Data Analytics Scalability (the emerging solution) If it is particionable, it is hadoopable.
MPP for Big Data Defined Massively Parallel Processing (Analytic Algorithms) Parallel computing is a well adopted technology seen in processor cores and software thread based parallelism. However, MPP leveraging thousands of networked commodity servers constrained only by bandwidth is now the emerging context for the Data Cloud.
MPP for Big Data Defined MPP databases have the ability to store and manage petabytes of data. Big Data analytics processes generally avoid shared storage. They prefer direct attached storage (DAS) in its various forms from solid state disk (SSD) to high capacity serial advanced technology attachment (SATA) disk buried inside parallel processing nodes. The perception of shared storage architectures SAN and NAS is that they are relatively slow, complex, and above all, expensive. Whereas Big Data analytics systems thrive on system performance, commodity infrastructure, and low cost Cloud Computing is a natural solution fit
Distributed Highly Fault Tolerant Massive Storage The Google File System (GFS) and Apache Hadoop Distributed File System (HDFS) are two examples of proven approaches to creating distributed highly fault tolerant massive storage systems. Is reliable, allowing distributed storage and replication of bytes across networks and hardware assumed to fail at anytime Allows for massive, world scale storage that separates metadata from data Supports a write once, sporadic append, read many usage structure Stores very large files, often each greater than 1 terabyte in size Allows compute cycles to be easily moved to the data store, instead of moving data to a processer farm.
Programmatic Models for Scaling in the Data Cloud Building applications and the architectures that run in the Data Cloud requires new thinking about scale, elasticity, and resilience. Cloud application architectures follow two key tenets: Elasticity only use computing resources when needed Scalability survive drastically changing data volumes
The Blurring of Public, External and Internal Big Data (the problem) CTA 2012 - May 21-23, 2012 Daytona Beach, FL
The Blurring of Public, External and Internal Big Data (the emerging solution) CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Security and Big Data (the problem) Just like this data is potentially valuable to you, so also is it valuable to an attacker. Some 1,271 government organizations and 1,931 private companies work on programs related to counterterrorism, homeland security and intelligence in about 10,000 locations across the United States Intelligence analysts alone publish 50,000 reports each year Approximately 854,000 people hold top secret security clearances ( nearly 1.5 times as many people as live in Washington, D.C. ). Source: Big data: Information security downsides (and upsides too!) Source: Washington Post CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Security and Big Data (the emerging solution) Source: Forester Research, Inc.
Security and Big Data (the emerging solution) Classification Data classification is critical to protecting the data Consistent Security Controls Centralizing the data allows for defining consistent security controls Auditing Logging and analyzing the logs is the key to finding security bridges and data misuse trends Toxic Data Destroying unneeded information is key to reducing the risk CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Privacy Concerns (the problem) Much of this will look like Big Brother vacuuming up every scrap of people s behavior and knowledge and using in ways that were never intended.* How anonymous it really is? How to measure the risk to the privacy of the individual social media introduces? When do the conclusions discerned based on big data analysis cross the boundary of intelligence? *Source: How social media and big data will unleash what we know by Dion Hinchcliffe CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Privacy Concerns (the solution) Strategies to assure anonymous data while still having the ability to produce meaningful analytical results Privacy policies specifically developed to target the problems introduced by big data Privacy policy automation to meet the requirement for large scale distributed application of policies CTA 2012 - May 21-23, 2012 Daytona Beach, FL
Effective Use of Big Data (the problem) Big data is all about volume, variety and velocity How do we even know what is the question it can help answer?
Effective Use of Big Data (the solution) It is all about mission enabled technology solutions! What about pattern recognition and predictive analytics providing the answers before we even knew we had the question?
The Intersection of Social Media and Big Data As the world continues to become more and more social competitive advantage will come to those who understand what s happening better than their peers and can directly connect it to their business outcomes and other useful pursuits. * *Source: How social media and big data will unleash what we know by Dion Hinchcliffe