Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M)

Size: px
Start display at page:

Download "Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M)"

Transcription

1 Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M) Dr. Jeremy Kepner Computing & Analytics Group Dr. Christian Anderson Intelligence & Decision Technologies Group This work is sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. D4M-1

2 Acknowledgements Benjamin O Gwynn Darrell Ricke Dylan Hutchinson Bill Arcand Bill Bergeron David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Andrew Weinert D4M-2

3 Outline Introduction Capabilities Data Analytics Summary D4M-3

4 Example Applications of Graph Analytics ISR Social Cyber Graphs represent entities and relationships detected through multi-int sources 1,000s 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent relationships between individuals or documents 10,000s 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent communication patterns of computers on a network 1,000,000s 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets D4M-4

5 Big Data + Big Compute Stack Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics High Level Composable API: D4M ( Databases for Matlab ) A C B Array Algebra E Distributed Database: Accumulo (triple store) Distributed Database/ Distributed File System High Performance Computing: LLGrid + Hadoop Interactive Supercomputing Combining Big Compute and Big Data enables entirely new domains D4M-5

6 LLGrid Database Architecture (DaaS) Interactive Compute Job Interactive VM Job Interactive Database Job Service Nodes Project Data Compute Nodes Cluster Switch Network Storage Scheduler Monitoring System LAN Switch Databases (DBs) give users low latency atomic access to large quantities of unstructured data with well defined interfaces Good mathc for cyber, text and Bto datasets Service allows users to Launch interactive DBs from their desktop Share project specific DBs D4M-6 IaaS: Infrastructure as Service PaaS: Platform as a Service DaaS: Data as a Service

7 High Level Language: D4M Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment A C B Query: Alice Bob Cathy David Earl A D4M query returns a sparse matrix or a graph E D for statistical signal processing or graph analysis in MATLAB D4M-7 D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization

8 Outline Introduction Capabilities Data Analytics Summary D4M-8

9 What is Big Data? Big Tables Spreadsheets Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, Facebook, ) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, Simultaneous diverse uses: matrices, functions, hash tables, databases, No formal mathematical basis; Zero papers in AMA or SIAM D4M-9

10 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or ('alice ','bob ',47.0) bob bob A T x A T x cited alice carl cited carl alice D4M-10

11 Composable Associative Arrays Key innovation: mathematical closure All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation D4M-11

12 run time (seconds) 100x faster Leveraging Big Data Technologies for High Speed Sequence Matching D4M 100x smaller BLAST D4M + Triple Store code volume (lines) High performance triple store database trades computations for lookups Used Apache Accumulo database to accelerate comparison by 100x Used Lincoln D4M software to reduce code size by 100x D4M-12

13 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Accumulo Database: 1 Master + 7 Tablet servers 4 Mil e/s Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files D4M-13

14 References Book: Graph Algorithms in the Language of Linear Algebra Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: Bader (Ga Tech) Bliss (MIT-LL) Bond (MIT-LL) Dunlavy (Sandia) Faloutsos (CMU) Fineman (CMU) Gilbert (USCB) Heitsch (Ga Tech) Hendrickson (Sandia) Kegelmeyer (Sandia) Kepner (MIT-LL) Kolda (Sandia) Leskovec (CMU) Madduri (Ga Tech) Mohindra (MIT-LL) Nguyen (MIT) Radar (MIT-LL) Reinhardt (Microsoft) Robinson (MIT-LL) Shah (USCB) D4M-14

15 Outline Introduction Capabilities Data Analytics Summary D4M-15

16 Assembled for Text REtrieval Conference (TREC 2011)* Designed to be a reusable, representative sample of the twittersphere Many languages 16,141,812 million tweets sampled during to (16,951 from before) Tweets2011 Corpus 11,595,844 undeleted tweets at time of scrape ( ) 161,735,518 distinct data entries 5,356,842 unique users 3,513,897 unique handles (@) 519,617 unique hashtags (#) Ben Jabur et al, ACM SAC 2012 *McCreadie et al, On building a reusable Twitter corpus, ACM SIGIR 2012 D4M-16

17 D4M Pipeline 1. Parse 2. Ingest 3a. Query 4. Analyze Raw Data Files Triple Files, Assoc Files Accumulo Database 3b. Scan Assoc Arrays Tweets2011 (single node database) Raw Data: 1.8GB (1621 Files; 100K tweets/file) 1. Parse: 20 minutes (N p = 16) Triple Files & Assoc Files: 6.7GB (7x1621 files) 2. Ingest: 40 minutes (N p = 3); 20 entries/tweet, 200K entries/second Accumulo Database: 14GB in 350M entries Raw data to fully indexed database < 1 hour Database good for queries, assoc files good for complete scans D4M-17

18 Twitter Input Data TweetID User Status Time Text Michislipstick 200 Sun Jan 23 02:27: Tipo. Você rosana 200 Sun Jan 23 02:27: para la semana q termino doasabo 200 Sun Jan 23 02:27: お 腹 すいたずえ agusscastillo 200 Sun Jan 23 02:27: A nadie le va a importar nob_sin 200 Sun Jan 23 02:27: さて 札 幌 に 帰 るか bimosephano 200 Sun Jan 23 02:27: Wait :) _Word_Play 200 Sun Jan 23 02:27: Shawty is 53% and he pick missogeeeeb 200 Sun Jan 23 02:27: Lazy sunday ( ) oooo! PennyCheco null null Mixture of structured (TweetID, User, Status, Time) and unstructured (Text) Fits well into standard D4M Exploded Schema D4M-18

19 Generic D4M Triple Store Exploded Schema Input Data Time Col1 Col2 Col a a b b c c Accumulo Table: Ttranspose Col1 a Col1 b 1 Col2 b Col2 c 1 Col3 a 1 Col3 c 1 D4M-19 Col1 a Col1 b Col2 b Col2 c Col3 a Col3 c Accumulo Table: T Tabular data expanded to create many type/value columns Transpose pairs allows quick look up of either row or column Big endian time for parallel performance

20 Row Key Row Key Tweets2011 D4M Schema Accumulo Tables: Tedge/TedgeT Colum Key TedgeDeg t Row Key Degree TedgeTxt text Tipo. Você fazer uma plaquinha pra mim, com o nome do FC pra você tirar uma foto, pode fazer isso? Wait :) null Standard exploded schema indexes every unique string in data set TedgeDeg accumulate sums of every unique string TedgeTxt stores original text for viewing purposes D4M-20

21 Outline Introduction Capabilities Data Analytics Summary D4M-21

22 Data Request Size Analytic Use Cases Serial Program Serial or Parallel Program + Files Parallel Program + Parallel Files Serial Program serial memory Serial or Parallel Program + Database parallel memory / serial storage Total Data Volume Parallel Program + Parallel Database parallel storage Data volume and data request size determine best approach Always want to start with the simplest and move to the most complex D4M-22

23 Word Tallies D4M Code Tdeg = DB('TedgeDeg'); str2num(tdeg(startswith('word : '),:)) > D4M Result (1.2 seconds, N p = 1) word : word :( word :) word :D word :P word :p Sum table TedgeDeg allows tallies to be seen instantly D4M-23

24 Simple Join D4M Code to check size Tdeg = DB('TedgeDeg'); D4M Result (0.02 seconds, N p = 1) word couch 888, word potato 442 D4M Code to perform join Tdeg('word couch word potato ',:) T = DB('Tedge','TedgeT'); Ttxt = DB('TedgeTxt'); Ttxt(Row(sum(dblLogi(T(:,'word couch word potato ')),2)==2),:) D4M Result (0.14 seconds, N p = 1) Being an ardent potato couch myself, I have a more than familiar afinity with the wisdom behind "let After PLKN, I'm getting 'The Hills Season 1' box set DVD. Need to be a couch potato for a while! well it's no fun staying in the house being a couch potato how did I go from a weekend of sleeping/being a couch potato to a suddenly sleep-deprived #Monday? Sibs the couch potato ok.done with #generations I'm about too cash in all these couch potato chips Today was a couch potato day. The best show was "Top Chefs Just Desserts". Wanna make some... being the best couch potato ever!!! Charging fir the week ahead!! Pre Bday Function tonight inside #INDUSTRY don't be a couch potato PLL and GG haha I am officially a couch potato Using sum table TedgeDeg allows scale of query to be computed prior to execution of the full query D4M-24

25 Most Popular Hashtags (#) D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = Ti(StartsWith('word # '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (25 seconds, N p = 1) word #Egypt 14781, word #FF,Degree) 17508, word #Jan , word #jan , word #nowplaying 13334, word #np,degree) Iterators allow scanning of data in optimally sized blocks D4M-25

26 Most Popular Users by Reference D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (130 seconds, N p = 1) 32368, 7457, 9640, 8481, 4620, 15581, Iterators allow scanning of data in optimally sized blocks D4M-26

27 Users Who ReTweet the Most Problem Size D4M Code to check size of status codes Tdeg = DB('TedgeDeg'); Tdeg(StartsWith( stat '),:)) D4M Results (0.02 seconds, N p = 1) stat OK stat Moved permanently stat ReTweet stat Protected stat Deleted tweet stat Request timeout Sum table TedgeDeg indicates 836K retweets (~5% of total) Small enough to hold all TweeetIDs in memory On boundary between database query and complete file scan D4M-27

28 D4M Parallel Database Code T = DB('Tedge','TedgeT'); Users Who ReTweet the Most Parallel Database Query Ar = T(:,'stat 302 '); my = global_ind(zeros(size(ar,2),1,map([np 1],{},0:Np-1))); An = Assoc('','',''); N = 10000; for i=my(1):n:my(end) Ai = dbllogi(t(row(ar(i:min(i+n,my(end)),:)),:)); An = An + sum(ai(:,startswith('user,')),1); end Asum = gagg(asum > 2); D4M Result (130 seconds, N p = 8) user Puque , user Say 113, user carp_fans 115, user habu_bot 111, user kakusan_rt 135, user umaitofu 116 Each processor queries all the retweet TweetIDs and picks a subset Processors each sum all users in their tweets and then aggregate D4M-28

29 D4M Parallel File Scan Code Nfile = size(filelist); Users Who ReTweet the Most Parallel File Scan my = global_ind(zeros(nfile,1,map([np 1],{},0:Np-1))); An = Assoc('','',''); for i=my load(filelist{i}); An = An + sum(a(row(a(:,'stat 302,')),StartsWith('user,')),1); end An = gagg(an > 2); D4M Result (150 seconds, N p = 16) user Puque , user Say 113, user carp_fans 114, user habu_bot 109, user kakusan_rt 135, user umaitofu 114 Each processor picks a subset of files and scans them for retweets Processors each sum all users in their tweets and then aggregate D4M-29

30 Summary Big data is found across a wide range of areas Document analysis Computer network analysis DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation D4M-30

D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database

D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database DM. Schema: A General Purpose High Performance Schema for the Accumulo Database Jeremy Kepner, Christian Anderson, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Matthew Hubbell, Peter Michaleas,

More information

Driving Big Data With Big Compute

Driving Big Data With Big Compute Driving Big Data With Big Compute Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O Gwynn, Andrew Prout, Albert

More information

Achieving 100,000,000 database inserts per second using Accumulo and D4M

Achieving 100,000,000 database inserts per second using Accumulo and D4M Achieving 100,000,000 database inserts per second using Accumulo and D4M Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie

More information

Taming Biological Big Data with D4M

Taming Biological Big Data with D4M TAMING BIOLOGICAL BIG DATA WITH D4M Taming Biological Big Data with D4M Jeremy Kepner, Darrell O. Ricke, and Dylan Hutchison The supercomputing community has taken up the challenge of taming the beast

More information

Cloud Computing Where ISR Data Will Go for Exploitation

Cloud Computing Where ISR Data Will Go for Exploitation Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air

More information

Genetic Sequence Matching Using D4M Big Data Approaches

Genetic Sequence Matching Using D4M Big Data Approaches Genetic Sequence Matching Using Big Data Approaches Stephanie Dodson, Darrell O. Ricke, and Jeremy Kepner MIT Lincoln Laboratory, Lexington, MA, U.S.A Abstract Recent technological advances in Next Generation

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Big Data Technologies Compared June 2014

Big Data Technologies Compared June 2014 Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development

More information

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation

More information

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Research Issues in Big Data Analytics

Research Issues in Big Data Analytics Abstract Recently, Big Data has attracted a lot of attention from academia, industry as well as government. It is a very challenging research area. Big Data is term defining collection of large and complex

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc

Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc Instructor : Dr.Sadegh Davari Mentors : Dilhar De Silva, Rishita Khalathkar Team Members : Ankur Uprit Kiranmayi Ganti Pinaki Ranjan

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

How To Create A Cyber Situational Awareness System At The Lincoln Laboratory

How To Create A Cyber Situational Awareness System At The Lincoln Laboratory LLCySA: Making Sense of Cyberspace Scott M. Sawyer, Tamara H. Yu, Matthew L. Hubbell, and B. David O Gwynn Today s enterprises require new network monitoring systems to detect and mitigate advanced cyber

More information

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1 Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots

More information

Big Data and Databases

Big Data and Databases Big Data and Databases Vijay Gadepally ([email protected]) Lauren Milechin ([email protected]) This work is sponsored, by the Department of the ir Force, under ir Force Contract F8721-05-C-0002.

More information

White Paper: Datameer s User-Focused Big Data Solutions

White Paper: Datameer s User-Focused Big Data Solutions CTOlabs.com White Paper: Datameer s User-Focused Big Data Solutions May 2012 A White Paper providing context and guidance you can use Inside: Overview of the Big Data Framework Datameer s Approach Consideration

More information

Bigtable is a proven design Underpins 100+ Google services:

Bigtable is a proven design Underpins 100+ Google services: Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable

More information

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Data-intensive HPC: opportunities and challenges. Patrick Valduriez Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Big Data and Big Analytics

Big Data and Big Analytics Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture

More information

Querying MongoDB without programming using FUNQL

Querying MongoDB without programming using FUNQL Querying MongoDB without programming using FUNQL FUNQL? Federated Unified Query Language What does this mean? Federated - Integrates different independent stand alone data sources into one coherent view

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles [email protected] Abstract As the web, social networking,

More information

INTRODUCTION TO CASSANDRA

INTRODUCTION TO CASSANDRA INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

2.1.5 Storing your application s structured data in a cloud database

2.1.5 Storing your application s structured data in a cloud database 30 CHAPTER 2 Understanding cloud computing classifications Table 2.3 Basic terms and operations of Amazon S3 Terms Description Object Fundamental entity stored in S3. Each object can range in size from

More information

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Investigating Hadoop for Large Spatiotemporal Processing Tasks Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!

More information

Outline. What is Big data and where they come from? How we deal with Big data?

Outline. What is Big data and where they come from? How we deal with Big data? What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

How To Build A Cloud Computer

How To Build A Cloud Computer Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected]

Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected] Trend Too much information is a storage issue, certainly, but too much information is also

More information

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Viswanath Nandigam Sriram Krishnan Chaitan Baru Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI

More information

Ironfan Your Foundation for Flexible Big Data Infrastructure

Ironfan Your Foundation for Flexible Big Data Infrastructure Ironfan Your Foundation for Flexible Big Data Infrastructure Benefits With Ironfan, you can expect: Reduced cycle time. Provision servers in minutes not days. Improved visibility. Increased transparency

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD [email protected]

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD yoel@emet.co.il :Introducing Star-P The Open Platform for Parallel Application Development Yoel Jacobsen E&M Computing LTD [email protected] The case for VHLLs Functional / applicative / very high-level languages allow

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform... Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data

More information

BIG DATA ANALYTICS For REAL TIME SYSTEM

BIG DATA ANALYTICS For REAL TIME SYSTEM BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Evaluation of NoSQL databases for large-scale decentralized microblogging

Evaluation of NoSQL databases for large-scale decentralized microblogging Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information