Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M)

Similar documents

D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database

Driving Big Data With Big Compute

Achieving 100,000,000 database inserts per second using Accumulo and D4M

Taming Biological Big Data with D4M

Cloud Computing Where ISR Data Will Go for Exploitation

Genetic Sequence Matching Using D4M Big Data Approaches

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Oracle Big Data SQL Technical Update

Big Data Technologies Compared June 2014

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Research Issues in Big Data Analytics

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Cloud Computing

An Approach to Implement Map Reduce with NoSQL Databases

Massive Cloud Auditing using Data Mining on Hadoop

Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc

Bringing Big Data Modelling into the Hands of Domain Experts

Integrating Big Data into the Computing Curricula

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

How To Create A Cyber Situational Awareness System At The Lincoln Laboratory

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Big Data and Databases

White Paper: Datameer s User-Focused Big Data Solutions

Bigtable is a proven design Underpins 100+ Google services:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Big Systems, Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Data and Big Analytics

Querying MongoDB without programming using FUNQL

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

INTRODUCTION TO CASSANDRA

Architectures for massive data management

Data processing goes big

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Architectures for Big Data Analytics A database perspective

2.1.5 Storing your application s structured data in a cloud database

Investigating Hadoop for Large Spatiotemporal Processing Tasks

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Data Mining in the Swamp

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

How To Handle Big Data With A Data Scientist

Analysis of MapReduce Algorithms

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Outline. What is Big data and where they come from? How we deal with Big data?

Infrastructures for big data

Internals of Hadoop Application Framework and Distributed File System

How To Build A Cloud Computer

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Bayesian networks - Time-series models - Apache Spark & Scala

How To Scale Out Of A Nosql Database

Ironfan Your Foundation for Flexible Big Data Infrastructure

CitusDB Architecture for Real-Time Big Data

Graph Database Proof of Concept Report

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Parquet. Columnar storage for the people

Distributed Computing and Big Data: Hadoop and MapReduce

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Can the Elephants Handle the NoSQL Onslaught?

Challenges for Data Driven Systems

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

BIG DATA ANALYTICS For REAL TIME SYSTEM

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Log Mining Based on Hadoop s Map and Reduce Technique

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Introduction to DISC and Hadoop

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

BIG DATA What it is and how to use?

Evaluation of NoSQL databases for large-scale decentralized microblogging

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Big Data and Scripting Systems build on top of Hadoop

Transcription:

Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M) Dr. Jeremy Kepner Computing & Analytics Group Dr. Christian Anderson Intelligence & Decision Technologies Group This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. D4M-1

Acknowledgements Benjamin O Gwynn Darrell Ricke Dylan Hutchinson Bill Arcand Bill Bergeron David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Andrew Weinert D4M-2

Outline Introduction Capabilities Data Analytics Summary D4M-3

Example Applications of Graph Analytics ISR Social Cyber Graphs represent entities and relationships detected through multi-int sources 1,000s 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent relationships between individuals or documents 10,000s 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent communication patterns of computers on a network 1,000,000s 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets D4M-4

Big Data + Big Compute Stack Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics High Level Composable API: D4M ( Databases for Matlab ) A C B Array Algebra E Distributed Database: Accumulo (triple store) Distributed Database/ Distributed File System High Performance Computing: LLGrid + Hadoop Interactive Supercomputing Combining Big Compute and Big Data enables entirely new domains D4M-5

LLGrid Database Architecture (DaaS) Interactive Compute Job Interactive VM Job Interactive Database Job Service Nodes Project Data Compute Nodes Cluster Switch Network Storage Scheduler Monitoring System LAN Switch Databases (DBs) give users low latency atomic access to large quantities of unstructured data with well defined interfaces Good mathc for cyber, text and Bto datasets Service allows users to Launch interactive DBs from their desktop Share project specific DBs D4M-6 IaaS: Infrastructure as Service PaaS: Platform as a Service DaaS: Data as a Service

High Level Language: D4M http://www.mit.edu/~kepner/d4m Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment A C B Query: Alice Bob Cathy David Earl A D4M query returns a sparse matrix or a graph E D for statistical signal processing or graph analysis in MATLAB D4M-7 D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

Outline Introduction Capabilities Data Analytics Summary D4M-8

What is Big Data? Big Tables Spreadsheets Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, Facebook, ) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, Simultaneous diverse uses: matrices, functions, hash tables, databases, No formal mathematical basis; Zero papers in AMA or SIAM D4M-9

D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or ('alice ','bob ',47.0) bob bob A T x A T x cited alice carl cited carl alice D4M-10

Composable Associative Arrays Key innovation: mathematical closure All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation D4M-11

run time (seconds) 100x faster Leveraging Big Data Technologies for High Speed Sequence Matching 10000 D4M 100x smaller BLAST 1000 100 D4M + Triple Store 10 100 10000 1000000 code volume (lines) High performance triple store database trades computations for lookups Used Apache Accumulo database to accelerate comparison by 100x Used Lincoln D4M software to reduce code size by 100x D4M-12

Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Accumulo Database: 1 Master + 7 Tablet servers 4 Mil e/s Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files D4M-13

References Book: Graph Algorithms in the Language of Linear Algebra Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: Bader (Ga Tech) Bliss (MIT-LL) Bond (MIT-LL) Dunlavy (Sandia) Faloutsos (CMU) Fineman (CMU) Gilbert (USCB) Heitsch (Ga Tech) Hendrickson (Sandia) Kegelmeyer (Sandia) Kepner (MIT-LL) Kolda (Sandia) Leskovec (CMU) Madduri (Ga Tech) Mohindra (MIT-LL) Nguyen (MIT) Radar (MIT-LL) Reinhardt (Microsoft) Robinson (MIT-LL) Shah (USCB) D4M-14

Outline Introduction Capabilities Data Analytics Summary D4M-15

Assembled for Text REtrieval Conference (TREC 2011)* Designed to be a reusable, representative sample of the twittersphere Many languages 16,141,812 million tweets sampled during 2011-01-23 to 2011-02-08 (16,951 from before) Tweets2011 Corpus http://trec.nist.gov/data/tweets/ 11,595,844 undeleted tweets at time of scrape (2012-02-14) 161,735,518 distinct data entries 5,356,842 unique users 3,513,897 unique handles (@) 519,617 unique hashtags (#) Ben Jabur et al, ACM SAC 2012 *McCreadie et al, On building a reusable Twitter corpus, ACM SIGIR 2012 D4M-16

D4M Pipeline 1. Parse 2. Ingest 3a. Query 4. Analyze Raw Data Files Triple Files, Assoc Files Accumulo Database 3b. Scan Assoc Arrays Tweets2011 (single node database) Raw Data: 1.8GB (1621 Files; 100K tweets/file) 1. Parse: 20 minutes (N p = 16) Triple Files & Assoc Files: 6.7GB (7x1621 files) 2. Ingest: 40 minutes (N p = 3); 20 entries/tweet, 200K entries/second Accumulo Database: 14GB in 350M entries Raw data to fully indexed database < 1 hour Database good for queries, assoc files good for complete scans D4M-17

Twitter Input Data TweetID User Status Time Text 29002227913850880 Michislipstick 200 Sun Jan 23 02:27:24 +0000 2011 @mi_pegadejeito Tipo. Você... 29002228131954688 rosana 200 Sun Jan 23 02:27:24 +0000 2011 para la semana q termino... 29002228165509120 doasabo 200 Sun Jan 23 02:27:24 +0000 2011 お腹すいたずえ 29002228937265152 agusscastillo 200 Sun Jan 23 02:27:24 +0000 2011 A nadie le va a importar... 29002229444771841 nob_sin 200 Sun Jan 23 02:27:24 +0000 2011 さて札幌に帰るか 29002230724038657 bimosephano 200 Sun Jan 23 02:27:25 +0000 2011 Wait :) 29002231177019392 _Word_Play 200 Sun Jan 23 02:27:25 +0000 2011 Shawty is 53% and he pick... 29002231202193408 missogeeeeb 200 Sun Jan 23 02:27:25 +0000 2011 Lazy sunday ( ) oooo! 29002231692922880 PennyCheco06 301 null null Mixture of structured (TweetID, User, Status, Time) and unstructured (Text) Fits well into standard D4M Exploded Schema D4M-18

Generic D4M Triple Store Exploded Schema Input Data Time Col1 Col2 Col3 2001-01-01 a a 2001-01-02 b b 2001-01-03 c c Accumulo Table: Ttranspose 01-01- 2001 Col1 a 1 02-01- 2001 Col1 b 1 Col2 b 1 03-01- 2001 Col2 c 1 Col3 a 1 Col3 c 1 D4M-19 Col1 a Col1 b Col2 b Col2 c Col3 a Col3 c 01-01-2001 1 1 02-01-2001 1 1 03-01-2001 1 1 Accumulo Table: T Tabular data expanded to create many type/value columns Transpose pairs allows quick look up of either row or column Big endian time for parallel performance

Row Key Row Key Tweets2011 D4M Schema Accumulo Tables: Tedge/TedgeT Colum Key 08805831972220092 75683042703220092 08822929613220092 TedgeDeg t Row Key Degree 108 642 73 286 150 7 836 327 825 822 6 7 7 454 596 8 6 7 3 3 454 603 9 1 6 102 23 162 4 TedgeTxt text 08805831972220092 @mi_pegadejeito Tipo. Você fazer uma plaquinha pra mim, com o nome do FC pra você tirar uma foto, pode fazer isso? 75683042703220092 Wait :) 08822929613220092 null Standard exploded schema indexes every unique string in data set TedgeDeg accumulate sums of every unique string TedgeTxt stores original text for viewing purposes D4M-20

Outline Introduction Capabilities Data Analytics Summary D4M-21

Data Request Size Analytic Use Cases Serial Program Serial or Parallel Program + Files Parallel Program + Parallel Files Serial Program serial memory Serial or Parallel Program + Database parallel memory / serial storage Total Data Volume Parallel Program + Parallel Database parallel storage Data volume and data request size determine best approach Always want to start with the simplest and move to the most complex D4M-22

Word Tallies D4M Code Tdeg = DB('TedgeDeg'); str2num(tdeg(startswith('word : '),:)) > 20000 D4M Result (1.2 seconds, N p = 1) word : 77946 word :( 80637 word :) 263222 word :D 151191 word :P 34340 word :p 56696 Sum table TedgeDeg allows tallies to be seen instantly D4M-23

Simple Join D4M Code to check size Tdeg = DB('TedgeDeg'); D4M Result (0.02 seconds, N p = 1) word couch 888, word potato 442 D4M Code to perform join Tdeg('word couch word potato ',:) T = DB('Tedge','TedgeT'); Ttxt = DB('TedgeTxt'); Ttxt(Row(sum(dblLogi(T(:,'word couch word potato ')),2)==2),:) D4M Result (0.14 seconds, N p = 1) 12551320743844323 Being an ardent potato couch myself, I have a more than familiar afinity with the wisdom behind "let... 14093676140884513 After PLKN, I'm getting 'The Hills Season 1' box set DVD. Need to be a couch potato for a while! 20276279685394823 @ItalianLoveee well it's no fun staying in the house being a couch potato 43806763553841692 how did I go from a weekend of sleeping/being a couch potato to a suddenly sleep-deprived #Monday? 61421826999744053 Sibs the couch potato RT @SibsMacd: ok.done with #generations now @Intersexions 61634735485149023 I'm about too cash in all these couch potato chips. 67333549982173643 Today was a couch potato day. The best show was "Top Chefs Just Desserts". Wanna make some... 75263579957836813 @thekatiestevens being the best couch potato ever!!! Charging fir the week ahead!! 88814535275189333 Pre Bday Function tonight inside #INDUSTRY don't be a couch potato 90855806092940792 PLL and GG haha I am officially a couch potato Using sum table TedgeDeg allows scale of query to be computed prior to execution of the full query D4M-24

Most Popular Hashtags (#) D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = Ti(StartsWith('word # '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (25 seconds, N p = 1) word #Egypt 14781, word #FF,Degree) 17508, word #Jan25 10349, word #jan25 11237, word #nowplaying 13334, word #np,degree) 11692 Iterators allow scanning of data in optimally sized blocks D4M-25

Most Popular Users (@) by Reference D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = Ti(StartsWith('word @ '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (130 seconds, N p = 1) word @ 32368, word @AddThis 7457, word @MentionKe: 9640, word @SoalCINTA: 8481, word @foursquare! 4620, word @justinbieber 15581, word @youtube 10206 Iterators allow scanning of data in optimally sized blocks D4M-26

Users Who ReTweet the Most Problem Size D4M Code to check size of status codes Tdeg = DB('TedgeDeg'); Tdeg(StartsWith( stat '),:)) D4M Results (0.02 seconds, N p = 1) stat 200 10864273 OK stat 301 2861507 Moved permanently stat 302 836327 ReTweet stat 403 825822 Protected stat 404 753882 Deleted tweet stat 408 1 Request timeout Sum table TedgeDeg indicates 836K retweets (~5% of total) Small enough to hold all TweeetIDs in memory On boundary between database query and complete file scan D4M-27

D4M Parallel Database Code T = DB('Tedge','TedgeT'); Users Who ReTweet the Most Parallel Database Query Ar = T(:,'stat 302 '); my = global_ind(zeros(size(ar,2),1,map([np 1],{},0:Np-1))); An = Assoc('','',''); N = 10000; for i=my(1):n:my(end) Ai = dbllogi(t(row(ar(i:min(i+n,my(end)),:)),:)); An = An + sum(ai(:,startswith('user,')),1); end Asum = gagg(asum > 2); D4M Result (130 seconds, N p = 8) user Puque007 103, user Say 113, user carp_fans 115, user habu_bot 111, user kakusan_rt 135, user umaitofu 116 Each processor queries all the retweet TweetIDs and picks a subset Processors each sum all users in their tweets and then aggregate D4M-28

D4M Parallel File Scan Code Nfile = size(filelist); Users Who ReTweet the Most Parallel File Scan my = global_ind(zeros(nfile,1,map([np 1],{},0:Np-1))); An = Assoc('','',''); for i=my load(filelist{i}); An = An + sum(a(row(a(:,'stat 302,')),StartsWith('user,')),1); end An = gagg(an > 2); D4M Result (150 seconds, N p = 16) user Puque007 100, user Say 113, user carp_fans 114, user habu_bot 109, user kakusan_rt 135, user umaitofu 114 Each processor picks a subset of files and scans them for retweets Processors each sum all users in their tweets and then aggregate D4M-29

Summary Big data is found across a wide range of areas Document analysis Computer network analysis DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation D4M-30