Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M)
|
|
|
- Molly Ray
- 10 years ago
- Views:
Transcription
1 Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M) Dr. Jeremy Kepner Computing & Analytics Group Dr. Christian Anderson Intelligence & Decision Technologies Group This work is sponsored by the Department of the Air Force under Air Force Contract #FA C Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. D4M-1
2 Acknowledgements Benjamin O Gwynn Darrell Ricke Dylan Hutchinson Bill Arcand Bill Bergeron David Bestor Chansup Byun Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee Andrew Weinert D4M-2
3 Outline Introduction Capabilities Data Analytics Summary D4M-3
4 Example Applications of Graph Analytics ISR Social Cyber Graphs represent entities and relationships detected through multi-int sources 1,000s 1,000,000s tracks and locations GOAL: Identify anomalous patterns of life Graphs represent relationships between individuals or documents 10,000s 10,000,000s individual and interactions GOAL: Identify hidden social networks Graphs represent communication patterns of computers on a network 1,000,000s 1,000,000,000s network events GOAL: Detect cyber attacks or malicious software Cross-Mission Challenge: Detection of subtle patterns in massive multi-source noisy datasets D4M-4
5 Big Data + Big Compute Stack Novel Analytics for: Text, Cyber, Bio Weak Signatures, Noisy Data, Dynamics High Level Composable API: D4M ( Databases for Matlab ) A C B Array Algebra E Distributed Database: Accumulo (triple store) Distributed Database/ Distributed File System High Performance Computing: LLGrid + Hadoop Interactive Supercomputing Combining Big Compute and Big Data enables entirely new domains D4M-5
6 LLGrid Database Architecture (DaaS) Interactive Compute Job Interactive VM Job Interactive Database Job Service Nodes Project Data Compute Nodes Cluster Switch Network Storage Scheduler Monitoring System LAN Switch Databases (DBs) give users low latency atomic access to large quantities of unstructured data with well defined interfaces Good mathc for cyber, text and Bto datasets Service allows users to Launch interactive DBs from their desktop Share project specific DBs D4M-6 IaaS: Infrastructure as Service PaaS: Platform as a Service DaaS: Data as a Service
7 High Level Language: D4M Distributed Database D4M Dynamic Distributed Dimensional Data Model Associative Arrays Numerical Computing Environment A C B Query: Alice Bob Cathy David Earl A D4M query returns a sparse matrix or a graph E D for statistical signal processing or graph analysis in MATLAB D4M-7 D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
8 Outline Introduction Capabilities Data Analytics Summary D4M-8
9 What is Big Data? Big Tables Spreadsheets Spreadsheets are the most commonly used analytical structure on Earth (100M users/day?) Big Tables (Google, Amazon, Facebook, ) store most of the analyzed data in the world (Exabytes?) Simultaneous diverse data: strings, dates, integers, reals, Simultaneous diverse uses: matrices, functions, hash tables, databases, No formal mathematical basis; Zero papers in AMA or SIAM D4M-9
10 D4M Key Concept: Associative Arrays Unify Four Abstractions Extends associative arrays to 2D and mixed data types A('alice ','bob ') = 'cited ' or A('alice ','bob ') = 47.0 Key innovation: 2D is 1-to-1 with triple store ('alice ','bob ','cited ') or ('alice ','bob ',47.0) bob bob A T x A T x cited alice carl cited carl alice D4M-10
11 Composable Associative Arrays Key innovation: mathematical closure All associative array operations return associative arrays Enables composable mathematical operations A + B A - B A & B A B A*B Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 Simple to implement in a library (~2000 lines) in programming environments with: 1 st class support of 2D arrays, operator overloading, sparse linear algebra Complex queries with ~50x less effort than Java/SQL Naturally leads to high performance parallel implementation D4M-11
12 run time (seconds) 100x faster Leveraging Big Data Technologies for High Speed Sequence Matching D4M 100x smaller BLAST D4M + Triple Store code volume (lines) High performance triple store database trades computations for lookups Used Apache Accumulo database to accelerate comparison by 100x Used Lincoln D4M software to reduce code size by 100x D4M-12
13 Accumulo Ingestion Scalability Study LLGrid MapReduce With A Python Application Accumulo Database: 1 Master + 7 Tablet servers 4 Mil e/s Data #1: 5 GB of 200 files Data #2: 30 GB of 1000 files D4M-13
14 References Book: Graph Algorithms in the Language of Linear Algebra Editors: Kepner (MIT-LL) and Gilbert (UCSB) Contributors: Bader (Ga Tech) Bliss (MIT-LL) Bond (MIT-LL) Dunlavy (Sandia) Faloutsos (CMU) Fineman (CMU) Gilbert (USCB) Heitsch (Ga Tech) Hendrickson (Sandia) Kegelmeyer (Sandia) Kepner (MIT-LL) Kolda (Sandia) Leskovec (CMU) Madduri (Ga Tech) Mohindra (MIT-LL) Nguyen (MIT) Radar (MIT-LL) Reinhardt (Microsoft) Robinson (MIT-LL) Shah (USCB) D4M-14
15 Outline Introduction Capabilities Data Analytics Summary D4M-15
16 Assembled for Text REtrieval Conference (TREC 2011)* Designed to be a reusable, representative sample of the twittersphere Many languages 16,141,812 million tweets sampled during to (16,951 from before) Tweets2011 Corpus 11,595,844 undeleted tweets at time of scrape ( ) 161,735,518 distinct data entries 5,356,842 unique users 3,513,897 unique handles (@) 519,617 unique hashtags (#) Ben Jabur et al, ACM SAC 2012 *McCreadie et al, On building a reusable Twitter corpus, ACM SIGIR 2012 D4M-16
17 D4M Pipeline 1. Parse 2. Ingest 3a. Query 4. Analyze Raw Data Files Triple Files, Assoc Files Accumulo Database 3b. Scan Assoc Arrays Tweets2011 (single node database) Raw Data: 1.8GB (1621 Files; 100K tweets/file) 1. Parse: 20 minutes (N p = 16) Triple Files & Assoc Files: 6.7GB (7x1621 files) 2. Ingest: 40 minutes (N p = 3); 20 entries/tweet, 200K entries/second Accumulo Database: 14GB in 350M entries Raw data to fully indexed database < 1 hour Database good for queries, assoc files good for complete scans D4M-17
18 Twitter Input Data TweetID User Status Time Text Michislipstick 200 Sun Jan 23 02:27: Tipo. Você rosana 200 Sun Jan 23 02:27: para la semana q termino doasabo 200 Sun Jan 23 02:27: お 腹 すいたずえ agusscastillo 200 Sun Jan 23 02:27: A nadie le va a importar nob_sin 200 Sun Jan 23 02:27: さて 札 幌 に 帰 るか bimosephano 200 Sun Jan 23 02:27: Wait :) _Word_Play 200 Sun Jan 23 02:27: Shawty is 53% and he pick missogeeeeb 200 Sun Jan 23 02:27: Lazy sunday ( ) oooo! PennyCheco null null Mixture of structured (TweetID, User, Status, Time) and unstructured (Text) Fits well into standard D4M Exploded Schema D4M-18
19 Generic D4M Triple Store Exploded Schema Input Data Time Col1 Col2 Col a a b b c c Accumulo Table: Ttranspose Col1 a Col1 b 1 Col2 b Col2 c 1 Col3 a 1 Col3 c 1 D4M-19 Col1 a Col1 b Col2 b Col2 c Col3 a Col3 c Accumulo Table: T Tabular data expanded to create many type/value columns Transpose pairs allows quick look up of either row or column Big endian time for parallel performance
20 Row Key Row Key Tweets2011 D4M Schema Accumulo Tables: Tedge/TedgeT Colum Key TedgeDeg t Row Key Degree TedgeTxt text Tipo. Você fazer uma plaquinha pra mim, com o nome do FC pra você tirar uma foto, pode fazer isso? Wait :) null Standard exploded schema indexes every unique string in data set TedgeDeg accumulate sums of every unique string TedgeTxt stores original text for viewing purposes D4M-20
21 Outline Introduction Capabilities Data Analytics Summary D4M-21
22 Data Request Size Analytic Use Cases Serial Program Serial or Parallel Program + Files Parallel Program + Parallel Files Serial Program serial memory Serial or Parallel Program + Database parallel memory / serial storage Total Data Volume Parallel Program + Parallel Database parallel storage Data volume and data request size determine best approach Always want to start with the simplest and move to the most complex D4M-22
23 Word Tallies D4M Code Tdeg = DB('TedgeDeg'); str2num(tdeg(startswith('word : '),:)) > D4M Result (1.2 seconds, N p = 1) word : word :( word :) word :D word :P word :p Sum table TedgeDeg allows tallies to be seen instantly D4M-23
24 Simple Join D4M Code to check size Tdeg = DB('TedgeDeg'); D4M Result (0.02 seconds, N p = 1) word couch 888, word potato 442 D4M Code to perform join Tdeg('word couch word potato ',:) T = DB('Tedge','TedgeT'); Ttxt = DB('TedgeTxt'); Ttxt(Row(sum(dblLogi(T(:,'word couch word potato ')),2)==2),:) D4M Result (0.14 seconds, N p = 1) Being an ardent potato couch myself, I have a more than familiar afinity with the wisdom behind "let After PLKN, I'm getting 'The Hills Season 1' box set DVD. Need to be a couch potato for a while! well it's no fun staying in the house being a couch potato how did I go from a weekend of sleeping/being a couch potato to a suddenly sleep-deprived #Monday? Sibs the couch potato ok.done with #generations I'm about too cash in all these couch potato chips Today was a couch potato day. The best show was "Top Chefs Just Desserts". Wanna make some... being the best couch potato ever!!! Charging fir the week ahead!! Pre Bday Function tonight inside #INDUSTRY don't be a couch potato PLL and GG haha I am officially a couch potato Using sum table TedgeDeg allows scale of query to be computed prior to execution of the full query D4M-24
25 Most Popular Hashtags (#) D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = Ti(StartsWith('word # '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (25 seconds, N p = 1) word #Egypt 14781, word #FF,Degree) 17508, word #Jan , word #jan , word #nowplaying 13334, word #np,degree) Iterators allow scanning of data in optimally sized blocks D4M-25
26 Most Popular Users by Reference D4M Database Code Tdeg = DB('Tdeg'); An = Assoc('','',''); Ti = Iterator(Tdeg,'elements',10000); A = '),:); while nnz(a) An = An + (str2num(a) > 100); A = Ti(); end D4M Result (130 seconds, N p = 1) 32368, 7457, 9640, 8481, 4620, 15581, Iterators allow scanning of data in optimally sized blocks D4M-26
27 Users Who ReTweet the Most Problem Size D4M Code to check size of status codes Tdeg = DB('TedgeDeg'); Tdeg(StartsWith( stat '),:)) D4M Results (0.02 seconds, N p = 1) stat OK stat Moved permanently stat ReTweet stat Protected stat Deleted tweet stat Request timeout Sum table TedgeDeg indicates 836K retweets (~5% of total) Small enough to hold all TweeetIDs in memory On boundary between database query and complete file scan D4M-27
28 D4M Parallel Database Code T = DB('Tedge','TedgeT'); Users Who ReTweet the Most Parallel Database Query Ar = T(:,'stat 302 '); my = global_ind(zeros(size(ar,2),1,map([np 1],{},0:Np-1))); An = Assoc('','',''); N = 10000; for i=my(1):n:my(end) Ai = dbllogi(t(row(ar(i:min(i+n,my(end)),:)),:)); An = An + sum(ai(:,startswith('user,')),1); end Asum = gagg(asum > 2); D4M Result (130 seconds, N p = 8) user Puque , user Say 113, user carp_fans 115, user habu_bot 111, user kakusan_rt 135, user umaitofu 116 Each processor queries all the retweet TweetIDs and picks a subset Processors each sum all users in their tweets and then aggregate D4M-28
29 D4M Parallel File Scan Code Nfile = size(filelist); Users Who ReTweet the Most Parallel File Scan my = global_ind(zeros(nfile,1,map([np 1],{},0:Np-1))); An = Assoc('','',''); for i=my load(filelist{i}); An = An + sum(a(row(a(:,'stat 302,')),StartsWith('user,')),1); end An = gagg(an > 2); D4M Result (150 seconds, N p = 16) user Puque , user Say 113, user carp_fans 114, user habu_bot 109, user kakusan_rt 135, user umaitofu 114 Each processor picks a subset of files and scans them for retweets Processors each sum all users in their tweets and then aggregate D4M-29
30 Summary Big data is found across a wide range of areas Document analysis Computer network analysis DNA Sequencing Currently there is a gap in big data analysis tools for algorithm developers D4M fills this gap by providing algorithm developers composable associative arrays that admit linear algebraic manipulation D4M-30
D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database
DM. Schema: A General Purpose High Performance Schema for the Accumulo Database Jeremy Kepner, Christian Anderson, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Matthew Hubbell, Peter Michaleas,
Driving Big Data With Big Compute
Driving Big Data With Big Compute Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O Gwynn, Andrew Prout, Albert
Achieving 100,000,000 database inserts per second using Accumulo and D4M
Achieving 100,000,000 database inserts per second using Accumulo and D4M Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie
Taming Biological Big Data with D4M
TAMING BIOLOGICAL BIG DATA WITH D4M Taming Biological Big Data with D4M Jeremy Kepner, Darrell O. Ricke, and Dylan Hutchison The supercomputing community has taken up the challenge of taming the beast
Cloud Computing Where ISR Data Will Go for Exploitation
Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air
Genetic Sequence Matching Using D4M Big Data Approaches
Genetic Sequence Matching Using Big Data Approaches Stephanie Dodson, Darrell O. Ricke, and Jeremy Kepner MIT Lincoln Laboratory, Lexington, MA, U.S.A Abstract Recent technological advances in Next Generation
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Big Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Research Issues in Big Data Analytics
Abstract Recently, Big Data has attracted a lot of attention from academia, industry as well as government. It is a very challenging research area. Big Data is term defining collection of large and complex
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Massive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc Instructor : Dr.Sadegh Davari Mentors : Dilhar De Silva, Rishita Khalathkar Team Members : Ankur Uprit Kiranmayi Ganti Pinaki Ranjan
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?
HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
How To Create A Cyber Situational Awareness System At The Lincoln Laboratory
LLCySA: Making Sense of Cyberspace Scott M. Sawyer, Tamara H. Yu, Matthew L. Hubbell, and B. David O Gwynn Today s enterprises require new network monitoring systems to detect and mitigate advanced cyber
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
Big Data and Databases
Big Data and Databases Vijay Gadepally ([email protected]) Lauren Milechin ([email protected]) This work is sponsored, by the Department of the ir Force, under ir Force Contract F8721-05-C-0002.
White Paper: Datameer s User-Focused Big Data Solutions
CTOlabs.com White Paper: Datameer s User-Focused Big Data Solutions May 2012 A White Paper providing context and guidance you can use Inside: Overview of the Big Data Framework Datameer s Approach Consideration
Bigtable is a proven design Underpins 100+ Google services:
Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable
Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Big Data and Big Analytics
Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture
Querying MongoDB without programming using FUNQL
Querying MongoDB without programming using FUNQL FUNQL? Federated Unified Query Language What does this mean? Federated - Integrates different independent stand alone data sources into one coherent view
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Market Basket Analysis Algorithm on Map/Reduce in AWS EC2
Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles [email protected] Abstract As the web, social networking,
INTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
2.1.5 Storing your application s structured data in a cloud database
30 CHAPTER 2 Understanding cloud computing classifications Table 2.3 Basic terms and operations of Amazon S3 Terms Description Object Fundamental entity stored in S3. Each object can range in size from
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!
Outline. What is Big data and where they come from? How we deal with Big data?
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected]
Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected] Trend Too much information is a storage issue, certainly, but too much information is also
Viswanath Nandigam Sriram Krishnan Chaitan Baru
Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Ironfan Your Foundation for Flexible Big Data Infrastructure
Ironfan Your Foundation for Flexible Big Data Infrastructure Benefits With Ironfan, you can expect: Reduced cycle time. Provision servers in minutes not days. Improved visibility. Increased transparency
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD [email protected]
:Introducing Star-P The Open Platform for Parallel Application Development Yoel Jacobsen E&M Computing LTD [email protected] The case for VHLLs Functional / applicative / very high-level languages allow
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
BIG DATA ANALYTICS For REAL TIME SYSTEM
BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Chapter 6. Foundations of Business Intelligence: Databases and Information Management
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Evaluation of NoSQL databases for large-scale decentralized microblogging
Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica
Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
