NoSQL for SQL Professionals William McKnight Session Code BD03
About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to Pfizer, Scotiabank, Teva Pharmaceuticals, Verizon, and many other Global 1000 companies A prolific writer with hundreds of articles, blogs and white papers in publication Focused on delivering business value and solving business problems utilizing proven, streamlined approaches to information management Former Fortune 50 Information Technology executive
No More
The No-Reference Architecture DATA STREAM PROCESSING GRAPH DATA HADOOP DATA WAREHOUSE DATA MARTS USERS/REPORTS MDBS RDBMS LEGACY SOURCES DATA WAREHOUSE APPLIANCE NOSQL COLUMNAR DATABASES ELEMENTS IN THE CLOUD IN-MEMORY DATABASES MASTER DATA DATA INTEGRATION SYNDICATED DATA
The Relational Database Data Page Page Header Records 1120Aris Doug Johnson Practice Director 206-676-5636 doug.johnson@aris.com 1121Stolt Offshore MS Craig Lennox Mr +66 1226 71269 craig.lennox@stoltoffshore.com 1122Medtronic, Inc. Mark Kohls Principle Database Administrator 763.516.2557 mark.kohls@medtronic.com Page Footer McKnight Consulting Group, 2010 Row IDs
What does Big Data Mean? Data in NoSQL - No SQL allowed or Not Only SQL? Sensor, social and web data? Data in a system that does not support SQL? A system with petabytes? Hadoop?
Why the Sudden Explosion of Interest? An increased number and variety of data sources that generate large quantities of data Sensors (e.g. location, RFID, ) Social (e.g. twitter, wikis, ) Web clicks Realization that data was too valuable to delete Even when little signal to lots of noise Dramatic decline in the cost of hardware, especially storage If storage was still $100/GB there would be no big data revolution underway
Why NoSQL for Big Data More data model flexibility JSON as a data model (think XML) No schema first requirement; load first Faster time to insight from data acquisition Relaxed ACID Eventual consistency Willing to trade consistency for availability ACID would crush things like storing clicks on Google Low upfront software costs Utilizes Java Full Scans Programmers love the freedoms
Hadoop, MapReduce and Big Data Parallel programming framework Hadoop is an open source distributed file system (HDFS) plus MapReduce Hadoop is used by those facing webscale-data challenges
Scale Up vs. Scale Out Single, fixed-resource controller Growth through adding shelves Multiple controllers Processing power in each unit of disk
Who uses Hadoop 40,000+ nodes running Hadoop Research for Ad systems and web search Product search indexes Analytics from user sessions Log analysis for reporting and analytics and machine learning Log analysis, data mining, and machine learning Large scale image conversion High energy physics, genomics, Digital Sky Survey
ACID Atomicity full transactions pass or fail Consistency database in valid state after each transaction Isolation transactions do not interfere with one another Durability transactions remain committed no matter what (i.e., crashes)
What Gives the CIO Heartburn About NoSQL Developer Skills Lack of ACID Compliance Tools lacking and Projects Flawed Fast Nature of Unburdened Projects Different Developers Schema-less/lite Models Lack of Payback Methodology
DFS Block Placement Example: write affinity to minimize cross-rack network traffic to tolerate switch failures
File System Summary Highly scalable 1000s of nodes and massive (100s of TB) files Large block sizes to maximize sequential I/O performance No use of mirroring or RAID. Reduce cost Use one mechanism (triply replicated blocks) to deal with a wide variety of failure types rather than multiple different mechanisms Negatives Lack of control over record placement Makes it impossible to employ many optimizations successfully employed by parallel DB systems
MapReduce 1. Take a large problem and divide it into sub-problems 2. Perform the same function on all sub-problems 3. Combine the output
MapReduce (MR) Programming framework (library and runtime) for analyzing data sets stored in HDFS MapReduce jobs are composed of two functions Map Reduce User only writes the Map and Reduce functions MR framework provides all the glue and coordinates the execution of the Map and Reduce jobs on the cluster. Fault tolerant Scalable
A Quick Summary Data Model Parallel DB Systems Structured data with known schema NoSQL Any data will fit in any format (un)(semi)structured Hardware Configuration Fault Tolerance Purchased as an appliance Failures assumed to be rare No query level fault tolerance User assembled from commodity machines Failures assumed to be common Simple, yet efficient, fault tolerance. Where to do big data analytics?
Key-Value Stores NoSQL OLTP A record may look like: Book: Of Mice and Men": Author: Steinbeck Great for unstructured data centered on a single object. Typically used as a cache for data frequently requested by web applications such as online shopping carts or social media sites.
Document Stores A record may look like: id => 12345, name => Jane, age => 22, address => number => 123 street => Main Often deployed for web-traffic analysis, social gaming, content stores, user-behavior/action analysis, or log-file analysis in real time.
Graph Stores: Emphasizing Relationships as Primary Data Based on Graph Theory Vertices (nodes), edges (relations) and properties Navigating social networks, configurations and recommendations i.e., Get the cheapest flights from DFW to SYD leaving on 7/12/13 with a minimum number of stops and each stop less than 2 hours. i.e., Social Networks Churn and Offer Management
Operational Big Data Platform Selection Data Size Key-Value Document Column Store SQL Graph Workload Complexity
The NoSQL Challenge
There s No Technology Silver Bullet Source: ebay, ebay Extreme Analytics in a Virtual World, Nov 10, 2010 24
Enablers for NoSQL
Data Integration Increasingly data first lands in the unstructured universe NoSQL stores are big data "EL" tools The Need for Data Integration with the Enterprise
Data Virtualization Enterprise Data Virtualization Data Warehouses Marts & Cubes Operational Data Stores Transactional Sources File Systems Big Data Enterprise-wide data fabric providing consistent and timely access to all structured and semi-structured data 2011 Composite Software, Inc. / Composite Proprietary
Infrastructure Strategy, including Cloud The benefits of cloud computing are: On-Demand and Self Service Broad Network Access Resource Pooling Rapid Elasticity Measured Service Source: Cloud Security and Privacy. An Enterprise Perspective on Risks&Compliance (Mather, Kumaraswamy & Latif)
What Will Motivate IT to Adopt NoSQL? Continuation of Big Vendor Legacy Seen as Too Expensive Scaling: Data > 1 Machine Schema Flexibility Mandatory Requirements to Keep Multiple Years of Highly Detailed Data Tired of Losing Deals to More Agile Hybrid IT Organizations NoSQL Tool Marketplace Innovations
NoSQL for SQL Professionals William McKnight