Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M)

Transcription

1 Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M) Dr. Jeremy Kepner Computing & Analytics Group Dr. Christian Anderson Intelligence & Decision Technologies Group This work is sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. D4M-1

2 Acknowledgements Benjamin O Gwynn Darrell Ricke Dylan Hutchinson Bill Arcand Bill Bergeron David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Andrew Weinert D4M-2

3 Outline Introduction Capabilities Data Analytics Summary D4M-3

4 Example Applications of Graph Analytics ISR Social Cyber Graphs represent entities and relationships detected through multi-int sources 1,000s 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent relationships between individuals or documents 10,000s 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent communication patterns of computers on a network 1,000,000s 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets D4M-4

5 Big Data + Big Compute Stack Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics High Level Composable API: D4M ( Databases for Matlab ) A C B Array Algebra E Distributed Database: Accumulo (triple store) Distributed Database/ Distributed File System High Performance Computing: LLGrid + Hadoop Interactive Supercomputing Combining Big Compute and Big Data enables entirely new domains D4M-5

6 LLGrid Database Architecture (DaaS) Interactive Compute Job Interactive VM Job Interactive Database Job Service Nodes Project Data Compute Nodes Cluster Switch Network Storage Scheduler Monitoring System LAN Switch Databases (DBs) give users low latency atomic access to large quantities of unstructured data with well defined interfaces Good mathc for cyber, text and Bto datasets Service allows users to Launch interactive DBs from their desktop Share project specific DBs D4M-6 IaaS: Infrastructure as Service PaaS: Platform as a Service DaaS: Data as a Service

7 High Level Language: D4M Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment A C B Query: Alice Bob Cathy David Earl A D4M query returns a sparse matrix or a graph E D for statistical signal processing or graph analysis in MATLAB D4M-7 D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

9 What is Big Data? Big Tables Spreadsheets Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, Facebook, ) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, Simultaneous diverse uses: matrices, functions, hash tables, databases, No formal mathematical basis; Zero papers in AMA or SIAM D4M-9

10 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or ('alice ','bob ',47.0) bob bob A T x A T x cited alice carl cited carl alice D4M-10

11 Composable Associative Arrays Key innovation: mathematical closure All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation D4M-11

12 run time (seconds) 100x faster Leveraging Big Data Technologies for High Speed Sequence Matching D4M 100x smaller BLAST D4M + Triple Store code volume (lines) High performance triple store database trades computations for lookups Used Apache Accumulo database to accelerate comparison by 100x Used Lincoln D4M software to reduce code size by 100x D4M-12

13 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Accumulo Database: 1 Master + 7 Tablet servers 4 Mil e/s Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files D4M-13

14 References Book: Graph Algorithms in the Language of Linear Algebra Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: Bader (Ga Tech) Bliss (MIT-LL) Bond (MIT-LL) Dunlavy (Sandia) Faloutsos (CMU) Fineman (CMU) Gilbert (USCB) Heitsch (Ga Tech) Hendrickson (Sandia) Kegelmeyer (Sandia) Kepner (MIT-LL) Kolda (Sandia) Leskovec (CMU) Madduri (Ga Tech) Mohindra (MIT-LL) Nguyen (MIT) Radar (MIT-LL) Reinhardt (Microsoft) Robinson (MIT-LL) Shah (USCB) D4M-14

16 Assembled for Text REtrieval Conference (TREC 2011)* Designed to be a reusable, representative sample of the twittersphere Many languages 16,141,812 million tweets sampled during to (16,951 from before) Tweets2011 Corpus 11,595,844 undeleted tweets at time of scrape ( ) 161,735,518 distinct data entries 5,356,842 unique users 3,513,897 unique handles (@) 519,617 unique hashtags (#) Ben Jabur et al, ACM SAC 2012 *McCreadie et al, On building a reusable Twitter corpus, ACM SIGIR 2012 D4M-16

17 D4M Pipeline 1. Parse 2. Ingest 3a. Query 4. Analyze Raw Data Files Triple Files, Assoc Files Accumulo Database 3b. Scan Assoc Arrays Tweets2011 (single node database) Raw Data: 1.8GB (1621 Files; 100K tweets/file) 1. Parse: 20 minutes (N p = 16) Triple Files & Assoc Files: 6.7GB (7x1621 files) 2. Ingest: 40 minutes (N p = 3); 20 entries/tweet, 200K entries/second Accumulo Database: 14GB in 350M entries Raw data to fully indexed database < 1 hour Database good for queries, assoc files good for complete scans D4M-17

18 Twitter Input Data TweetID User Status Time Text Michislipstick 200 Sun Jan 23 02:27: Tipo. Você rosana 200 Sun Jan 23 02:27: para la semana q termino doasabo 200 Sun Jan 23 02:27: お腹すいたずえ agusscastillo 200 Sun Jan 23 02:27: A nadie le va a importar nob_sin 200 Sun Jan 23 02:27: さて札幌に帰るか bimosephano 200 Sun Jan 23 02:27: Wait :) _Word_Play 200 Sun Jan 23 02:27: Shawty is 53% and he pick missogeeeeb 200 Sun Jan 23 02:27: Lazy sunday ( ) oooo! PennyCheco null null Mixture of structured (TweetID, User, Status, Time) and unstructured (Text) Fits well into standard D4M Exploded Schema D4M-18

19 Generic D4M Triple Store Exploded Schema Input Data Time Col1 Col2 Col a a b b c c Accumulo Table: Ttranspose Col1 a Col1 b 1 Col2 b Col2 c 1 Col3 a 1 Col3 c 1 D4M-19 Col1 a Col1 b Col2 b Col2 c Col3 a Col3 c Accumulo Table: T Tabular data expanded to create many type/value columns Transpose pairs allows quick look up of either row or column Big endian time for parallel performance

20 Row Key Row Key Tweets2011 D4M Schema Accumulo Tables: Tedge/TedgeT Colum Key TedgeDeg t Row Key Degree TedgeTxt text Tipo. Você fazer uma plaquinha pra mim, com o nome do FC pra você tirar uma foto, pode fazer isso? Wait :) null Standard exploded schema indexes every unique string in data set TedgeDeg accumulate sums of every unique string TedgeTxt stores original text for viewing purposes D4M-20

22 Data Request Size Analytic Use Cases Serial Program Serial or Parallel Program + Files Parallel Program + Parallel Files Serial Program serial memory Serial or Parallel Program + Database parallel memory / serial storage Total Data Volume Parallel Program + Parallel Database parallel storage Data volume and data request size determine best approach Always want to start with the simplest and move to the most complex D4M-22

23 Word Tallies D4M Code Tdeg = DB('TedgeDeg'); str2num(tdeg(startswith('word : '),:)) > D4M Result (1.2 seconds, N p = 1) word : word :( word :) word :D word :P word :p Sum table TedgeDeg allows tallies to be seen instantly D4M-23

24 Simple Join D4M Code to check size Tdeg = DB('TedgeDeg'); D4M Result (0.02 seconds, N p = 1) word couch 888, word potato 442 D4M Code to perform join Tdeg('word couch word potato ',:) T = DB('Tedge','TedgeT'); Ttxt = DB('TedgeTxt'); Ttxt(Row(sum(dblLogi(T(:,'word couch word potato ')),2)==2),:) D4M Result (0.14 seconds, N p = 1) Being an ardent potato couch myself, I have a more than familiar afinity with the wisdom behind "let After PLKN, I'm getting 'The Hills Season 1' box set DVD. Need to be a couch potato for a while! well it's no fun staying in the house being a couch potato how did I go from a weekend of sleeping/being a couch potato to a suddenly sleep-deprived #Monday? Sibs the couch potato ok.done with #generations I'm about too cash in all these couch potato chips Today was a couch potato day. The best show was "Top Chefs Just Desserts". Wanna make some... being the best couch potato ever!!! Charging fir the week ahead!! Pre Bday Function tonight inside #INDUSTRY don't be a couch potato PLL and GG haha I am officially a couch potato Using sum table TedgeDeg allows scale of query to be computed prior to execution of the full query D4M-24

25 Most Popular Hashtags (#) D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = Ti(StartsWith('word # '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (25 seconds, N p = 1) word #Egypt 14781, word #FF,Degree) 17508, word #Jan , word #jan , word #nowplaying 13334, word #np,degree) Iterators allow scanning of data in optimally sized blocks D4M-25

26 Most Popular Users by Reference D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (130 seconds, N p = 1) 32368, 7457, 9640, 8481, 4620, 15581, Iterators allow scanning of data in optimally sized blocks D4M-26

27 Users Who ReTweet the Most Problem Size D4M Code to check size of status codes Tdeg = DB('TedgeDeg'); Tdeg(StartsWith( stat '),:)) D4M Results (0.02 seconds, N p = 1) stat OK stat Moved permanently stat ReTweet stat Protected stat Deleted tweet stat Request timeout Sum table TedgeDeg indicates 836K retweets (~5% of total) Small enough to hold all TweeetIDs in memory On boundary between database query and complete file scan D4M-27

28 D4M Parallel Database Code T = DB('Tedge','TedgeT'); Users Who ReTweet the Most Parallel Database Query Ar = T(:,'stat 302 '); my = global_ind(zeros(size(ar,2),1,map([np 1],{},0:Np-1))); An = Assoc('','',''); N = 10000; for i=my(1):n:my(end) Ai = dbllogi(t(row(ar(i:min(i+n,my(end)),:)),:)); An = An + sum(ai(:,startswith('user,')),1); end Asum = gagg(asum > 2); D4M Result (130 seconds, N p = 8) user Puque , user Say 113, user carp_fans 115, user habu_bot 111, user kakusan_rt 135, user umaitofu 116 Each processor queries all the retweet TweetIDs and picks a subset Processors each sum all users in their tweets and then aggregate D4M-28

29 D4M Parallel File Scan Code Nfile = size(filelist); Users Who ReTweet the Most Parallel File Scan my = global_ind(zeros(nfile,1,map([np 1],{},0:Np-1))); An = Assoc('','',''); for i=my load(filelist{i}); An = An + sum(a(row(a(:,'stat 302,')),StartsWith('user,')),1); end An = gagg(an > 2); D4M Result (150 seconds, N p = 16) user Puque , user Say 113, user carp_fans 114, user habu_bot 109, user kakusan_rt 135, user umaitofu 114 Each processor picks a subset of files and scans them for retweets Processors each sum all users in their tweets and then aggregate D4M-29

30 Summary Big data is found across a wide range of areas Document analysis Computer network analysis DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation D4M-30