Big Data and Databases Vijay Gadepally (vijayg@ll.mit.edu) Lauren Milechin (lauren.milechin@ll.mit.edu) This work is sponsored, by the Department of the ir Force, under ir Force Contract F8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Outline Challenge Overview General Strategies Database Fundamentals and Technologies Up and Coming Technologies
Big Data Challenge Kids dults Elderly Users (deciders) Rapidly increasing - Data volume - Data velocity - Data variety - Date veracity Things Gap Humans 10 Years go 5 Years go Today In 5 Years Sources (providers) Building Security Building Environment Building Usage Commuter Vehicles Work Vehicles Transport Vehicles Student Smartphones Classroom Tablets Fitness Wearables
Challenge of Data Volume Where do I store my data? How much do I store? 1 TB total pplications & Data 2 TB total Data Scalable Data Center Data Flat file Spreadsheet How do I access it? Database Distributed database How do I index it?
Challenge of Data Velocity 2011 Data Generated Per Minute Facebook: 684,478 pieces of content Twitter: 100,000 tweets YouTube: 48 hours of new video Google: 2,000,000 new queries Internet Population: 2.1 Billion people
Challenge of Data Velocity
Challenge of Data Velocity 2014 Data Generated Per Minute Facebook: 2,460,00 pieces of content Twitter: 277,000 tweets YouTube: 72 hours of new video Google: 4,000,000 new queries Internet Population: 2.4 Billion people
Challenge of Data Velocity 2011 2014 Increase in Data Generated Facebook: 350 MB/min Twitter: 50 MB/min YouTube: 24 48 GB/min
Challenge of Data Velocity 2011 2014 Increase in Data Generated Facebook: 350 MB/min Twitter: 50 MB/min How do I capture my data for processing? YouTube: 24 48 GB/min How do I process the data within the specified time constraints?
Challenge of Data Variety What does the data look like? Tweets Images Text and Documents udio
Challenge of Data Variety What does the data look like? Tweets Images Text and Documents udio How do I index heterogeneous data formats? Strings may be easily stored in a database Image and document metadata may fit in traditional database Raw images/documents may require file system or alternate database
Challenge of Data Variety What does the data look like? Tweets Images Text and Documents udio How do I fuse heterogeneous data formats to provide uniform view? Fusion drives Indexing/schema decisions Technology (databases, storage, etc.) selection Selection of software (visualization, language) tools
Challenge of Data Variety What does the data look like? Tweets Images Text and Documents udio How do I develop algorithms for heterogeneous data formats? Images can use High Performance Computing tools Strings and documents require a new algebra to take advantage of High Performance computing systems Visualization requires merging image with string data
Challenge of Data Veracity Does the data need protection? How do I balance privacy with availability? What level of security is required? How do I make data available only to vetted analysts? How is data kept secure and private while minimizing impact on analysis?
Challenge of Data Veracity Does the data need protection? How confident am I in the integrity of my data? Where did it come from? Who has accessed it? Has anyone modified data stream? Has anyone tampered with the data stream?
Outline Challenge Overview General Strategies Database Fundamentals and Technologies Up and Coming Technologies
General Strategy: System Design Kids dults Elderly Users (deciders) User Interface Things Files Ingest & Ingest & Enrichment Enrichment Ingest Databases Humans nalytics B Gap C E D 10 Years go 5 Years go Today In 5 Years Scheduler Computing Sources (providers) Building Security Building Environment Building Usage Commuter Vehicles Work Vehicles Transport Vehicles Student Smartphones Classroom Tablets Fitness Wearables
General Strategies: Collection Collect, Store, and Process only Useful Data 10 4 Degree Distribution 10 3 Count 10 2 10 1 10 0 10 0 10 1 10 2 10 3 Degree d max
General Strategies: Collection Collect, Store, and Process only Useful Data Intelligently Reduce the mount of Data through Sampling Techniques 10 4 Degree Distribution 10 3 SIGNL Count 10 2 NOISE N-D SPCE 10 1 10 0 10 0 10 1 10 2 10 3 Degree d max Example background model: Power Law Graph
General Strategy: Privacy-Preserving Technology Kids dults Elderly Humans (deciders) Web Raw Data Ingest & Enrichment Ingest & Enrichment Ingest Databases nalytics B C E D Scheduler Computing Things (providers) Building Security Building Environment Building Usage Commuter Vehicles Work Vehicles Transport Vehicles Student Smartphones Classroom Tablets Fitness Wearables
General Strategy: Privacy-Preserving Technology Kids dults Elderly Humans (deciders) Web Raw Data Ingest & Enrichment Ingest & Enrichment Ingest Databases nalytics B C Data Integrity Data Integrity ttack E D Scheduler Computing Things (providers) Building Security Building Environment Building Usage Commuter Vehicles Work Vehicles Transport Vehicles Student Smartphones Classroom Tablets Fitness Wearables
General Strategy: Privacy-Preserving Technology Kids dults Elderly Humans (deciders) Web Raw Data Ingest & Enrichment Ingest & Enrichment Ingest Databases nalytics B C Data Integrity Data Integrity ttack Data Loss / Exfiltration E D Scheduler Computing Things (providers) Building Security Building Environment Building Usage Commuter Vehicles Work Vehicles Transport Vehicles Student Smartphones Classroom Tablets Fitness Wearables
General Strategy: Privacy-Preserving Technology Kids dults Elderly Humans (deciders) Web Raw Data Ingest & Enrichment Ingest & Enrichment Ingest Databases nalytics B C Data Integrity Data Integrity ttack Data Loss / Exfiltration Insider E Threat D Scheduler Computing Things (providers) Building Security Building Environment Building Usage Commuter Vehicles Work Vehicles Transport Vehicles Student Smartphones Classroom Tablets Fitness Wearables
General Strategy: Privacy-Preserving Technology Use Cryptographic Protocols to Protect the Confidentiality, Integrity, and/or vailability of Data Lots of ongoing research Popular techniques: Fully Homomorphic Encryption Multiparty Computation Computing on Masked Data (CMD) Cryptographic protections for NoSQL ccumulo database Uses order preserving, deterministic and semantically secure encryption 2-4x performance overhead Plaintext! Query! Plaintext! nalytic! Result! Encrypt Decrypt Masked! Query! Masked! nalytic! Result! CMD Big Data Cloud
Outline Challenge Overview General Strategies Database Fundamentals and Technologies Up and Coming Technologies
Database Fundamentals Database Collection of data and supporting data structures Database Management System (DBMS) Software that provides interface between user and database Define new data and schema data Retrieve (Query) data DB administration: set security and permissions BigTable
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails B SE Successful Transaction Failed Transaction
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE or
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions B SE
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed B SE Transaction
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed B SE Transaction
Database Fundamentals C I D tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed B SE Transaction
Database Fundamentals C I D B SE tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed Basically vailable Soft-state services with Eventual-consistency
Database Fundamentals C I D B SE tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed Basically vailable Soft-state services with Eventual-consistency
Database Fundamentals C I D B SE tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed Basically vailable Soft-state services with Eventual-consistency
Database Fundamentals C I D B SE tomicity- each transaction either fully succeeds or fails Consistency- all nodes see same valid data all the time Isolation- concurrent transactions result in system state obtained from serial transactions Durability- committed transactions remain committed Basically vailable Soft-state services with Eventual-consistency CID BSE BigTable
Database Fundamentals CP Theorem Impossible for a distributed system to simultaneously provide: Consistency vailability Partition Tolerance
Database Fundamentals CP Theorem Impossible for a distributed system to simultaneously provide: Consistency vailability Partition Tolerance
Database Fundamentals CP Theorem Impossible for a distributed system to simultaneously provide: Consistency vailability Partition Tolerance Consistency vailability Partition Tolerance BigTable
Database Fundamentals SQL NoSQL NewSQL DTBSES Cluster Dremel BigTable 1995 2004 2006 2008 2010 2012 2014 2016 PRLLEL PROCESSING MapReduce Hadoop Pregel D4M Giraph Slide Source: S. Sawyer, B. D. O'Gwynn,. Tran, T. Yu. Understanding Query Performance in ccumulo. HPEC 2013.
Database Fundamentals Consistency Relational DB Systems NewSQL DB Systems NoSQL DB Systems Performance
Relational Databases What it Is Database that stores information about data and how it is related Highly structured normalized table based database Predefined schema/organization of data Vertically scalable with good quality hardware Use SQL as query interface Typically provide full consistency Examples
Relational Databases Who Uses It Dealing with transactional data Problem sizes are moderate Need for CID guarantees When to Use It How to Use It JDBC (Java DataBase Connector) SQL command line
Relational Databases Tweet Table Tweet ID User ID Location ID Tweet Text 096360448 67555 wwz4p7jd Omg earthquake 544019456 67554 wwh1hss5 We're gonna have an earthquake 600791040 67556 wwwygbvq Omg it's a earthquake User ID Username Friends Count 67554 _zariaaa_ 541 67555 gnvrly_ron 693 67556 yolvndv 424 UserTable Location ID Latitude Longitude wwh1hss5 33.951186-118.328370 wwwygbvq 37.754312-122.164388 wwz4p7jd 38.337154-122.670192 Location Table
NoSQL Databases What it Is Database based on documents, key-value pairs, graphs, or widecolumn stores Dynamic schema Horizontal scalability Typically provide eventual consistency Examples
NoSQL Databases Who Uses It When to Use It Large unstructured datasets Strong need for high performance Only require BSE guarantees Python/JV bindings Lincoln Laboratory D4M Command Line How to Use It
NoSQL Databases Edge Table Degree Table Degree FriendCount 424 FriendCount 541 FriendCount 693 Latitude 33.951186 Latitude 37.754312 Latitude 38.337154 Location wwh1hss5 Location wwwygbvq Location wwz4p7jd UserID 67556 UserName _zariaaa_ UserName gnvrly_ron UserName yolvndv Word Omg Word a Word an Word earthquake FriendCount 424 1 FriendCount 541 1 FriendCount 693 1 Latitude 33.951186 1 096360448 544019456 600791040 Word an 1 Word earthquake 3 096360448 544019456 600791040 FriendCount 424 FriendCount 541 Word an Word earthquake Transpose Table Text Table Text 096360448 Omg earthquake 544019456 We're gonna have an earthquake 600791040 Omg it's a earthquake
NoSQL Example - ccumulo
ccumulo Design Drivers 1 2 Cell-Level Security Express common security requirements in the infrastructure, not just in the application Data-centric approach encourages secure sharing Scalability Near linear performance improvements at thousands of nodes Durable and reliable under increased failures that come with scale 3 Diverse, Interactive nalytics Sorted key/value core performs well in a diverse set of domains Information retrieval, statistics, graph analysis, geo indexing, and more 4 Flexible, daptive Schema Start with universal structures and indexing Refine the schema over time Source: Sqrrl Data Inc
ccumulo Features Visibility Labels Iterators utomatic table splitting Support for pache Thrift proxy Visibility Iterator Table-split Thrift Schema D4M volume velocity variety veracity
NewSQL Databases What it Is Database systems that emulate performance of NoSQL along with CID guarantees of Relational Databases Usually scaled up version of a relational database Often uses array data model Other data models include graph-based data structures and distributed relational tables May make use of in-memory processing or specialized hardware Examples
NewSQL Databases Who Uses It When to Use It Large multidimensional datasets Data that doesn t fit in traditional databases Have the volume for NoSQL, but need for CID guarantees How to Use It Each have custom PI Ex: SciDB uses JDBC, SHIM, D4M, R-SciDB binding
NewSQL Example: SciDB
SciDB Design Drivers SciDB R, Python, Matlab, Julia, Massive Parallel Processing Database rray data model Complex analytics Commodity clusters or cloud
SciDB Example Schema Highly customizable to application Each cell is a strongly-typed structure of attributes: <int>, or <double, string, float>, or Nullable attributes, empty cells, sparse, or dense stock!! MSFT! MSFVX! MT!! price: 15.76! volume: 200! price: 234.2! volume: 10! price: 17.50! volume: null! price: 17.40! volume: 100! price: 0.02! volume: null! 12342778213! 12342778214! 12342778215!! time!
SciDB Features Massive Parallel Database rray Data Model nalytic language support In-database analytics MPP DB rray Languages nalytics volume velocity variety veracity
Quick Reference RDBMS vs. NoSQL vs. NewSQL Examples Schema rchitecture Guarantees ccess Relational Databases MySQL, PostgreSQL, Oracle Typed columns with relational keys Single-node or sharded CID transactions SQL, indexing, joins, and query planning NoSQL HBase, Cassandra, ccumulo Schema-less Distributed, scalable Eventually consistent Low-level PI (scans and filtering) NewSQL SciDB, VoltDB, MemSQL Strongly-typed structure of attributes Distributed, scalable CID transactions (most) Custom PI, JDBC, Bindings to popular languages Slide Source: S. Sawyer, B. D. O'Gwynn,. Tran, T. Yu. Understanding Query Performance in ccumulo. HPEC 2013.
Outline Challenge Overview General Strategies Database Fundamentals and Technologies Up and Coming Technologies
On The Horizon New Technologies and Techniques New database and processing technology such as: pache Spark: In memory distributed processing TileDB: Database for scientific big data S-Store: Database tuned for streaming data New cross database and storage engine standards, PI, and practices: BigDawg: n PI to simplify big data analytics currently being designed GraphBLS: n effort to standardize graph algorithms and databases dvances in privacy preserving technology: SPED: Signal processing in the encrypted domain Greater efficiency of protocols such as Functional Encryption and Multiparty Computation Tools and technologies will continue to evolve important to keep students abreast of new developments
Conclusions Lots of stuff going on! Very important to understand details of your dataset, end analytic, and other requirements Topics covered: Challenge overview (What is the problem?) Some general strategies Databases Upcoming technologies
Leading Science and Engineering Research University 80 Nobel laureates, 50 National Medal of Science recipients Thousands of companies (11 th largest world economy) 1000 faculty, 10000 employees, 10000 students $1.4B in annual external research funding Lincoln Laboratory $800M Other MIT $600M
bout MIT