Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany
|
|
- Isabel Ray
- 8 years ago
- Views:
Transcription
1 Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
2 Inclusion Dependencies Examples Customers ID Name Address 1 Tanja Jager Marseiller Str Sandra Möller Flughafenstr Dennis Eberhart Sonnenallee 19 4 Barbara Pabst Ziegelstr Thorsten Mauer Güntzelstr. 90 Customer Quantity ID Orders Customer Quantity 3 CK DF KE LM PE
3 Inclusion Dependencies Examples 3
4 Inclusion Dependencies Example 4
5 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
6 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
7 Related Work MIND Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53 73, ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr CK Dennis Eberhart Sonnenallee 19 3 DF Barbara Pabst Ziegelstr KE Thorsten Mauer Güntzelstr LM PE
8 Related Work MIND Value Attributes 1 ID, Customer, Quantity Tanja Jager Name Quantity ID Quantity? Quantity Quantity Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Flughafenstr. 63 Intersection Name Address ID, Quantity 8
9 Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, ID Name Address 1 Tanja Jager Marseiller Str. 12 Customer Quantity 2 Sandra Möller Flughafenstr CK Dennis Eberhart Sonnenallee 19 3 DF Barbara Pabst Ziegelstr KE Thorsten Mauer Güntzelstr LM PE
10 Related Work SPIDER Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, ID Name Customer Quantity 1 Barbara Pabst 1 CK Dennis Eberhart 3 DF Sandra Möller 5 KE Tanja Jager Thorsten Mauer LM PE
11 Related Work Common proceeding Input Data Full Outer Join Inclusion Dependencies ID Name Addr ID Name Addr Cus Qty 1 T.J. M.12 2 S.M. F.63 Cus Qty 3 D.E. S.19 3 CK 1 4 B.P. Z.76 3 DF 1 5 T.M. G.90 3 KE 1 1 LM 2 5 PE 1 Step 1: T.J. S.M. Calculate full outer join of all attributes Step 2: Quantity ID Customer ID Extract inclusion dependencies 11
12 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
13 SINDY: A distributed discovery algorithm Distributed setting ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Qty 1 LM 2 5 PE 1 Cus Qty 3 CK 1 3 DF 1 3 KE 1 13
14 SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 2 ID T.J. Name S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 3 Cus 3 Cus CK DF KE 1 Qty 1 Qty 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 5 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 1 Cus 5 Cus LM PE 2 Qty 1 Qty 5 Cus, ID 14
15 SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 1 ID 1 ID, Cus, T.J. Qty Name 2 ID S.M. Name M.12 Addr F.63 Addr 3 D.E. S.19 3 ID D.E. Name S.19 Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 3 Cus 2 ID, Qty CK DF KE 1 Qty ID Name Addr 4 B.P. Z.76 5 T.M. G.90 4 ID 3 ID, Cus B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus, Qty Cus Qty 1 LM 2 5 PE 1 LM PE 2 Qty 5 Cus, ID 15
16 SINDY: A distributed discovery algorithm Perform full outer join ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 ID, Cus, Qty ID Name Name Addr Addr Cus Qty 3 CK 1 3 DF 1 3 KE 1 ID, Qty Cus, ID Name ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus, ID Name Addr Addr Addr Cus Qty 1 LM 2 5 PE 1 Name 16
17 SINDY: A distributed discovery algorithm Distributed join product ID, Cus, Qty Addr Name ID ID, Cus Name Addr ID, Cus ID, Qty Name 17
18 SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Addr Name ID ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Ø ID, Cus ID, Qty ID Qty ID Ø ID Cus Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 18
19 SINDY: A distributed discovery algorithm Evaluate full outer join ID, Cus, Qty Cus ID, Qty Qty ID, Cus Addr Addr Ø Name Name Ø ID ID Ø Ø ID, Cus ID, Qty ID Ø Qty ID Cus ID Name Name Ø Ø ID, Cus ID Cus Cus ID Name Name Ø Ø Addr Addr Ø 19
20 SINDY: A distributed discovery algorithm Distributed inclusion dependencies Addr Ø ID Ø Qty ID Cus ID Ø Quantity ID Customer ID Name Ø 20
21 SINDY Variants Inclusion dependencies on combinations of columns (aka n-ary INDs) Adaption: Create cells for combinations of values Powerful in combination with apriori-like proceeding Partial inclusion dependencies Adaption: aggregate IND candidates with multiset union instead of intersection Compare with number of distinct values of dependent column 21
22 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
23 Evaluation Experimental setup Cluster Setup 1 master node (Intel 2x2.67 GHz, 8 GiB RAM) 10 worker nodes (Intel Core 2 2x2.6 GHz, 8 GiB RAM) Apache HDFS 2.2, Apache Flink Single node (for SPIDER) Intel 8x2GHz, 128 GiB RAM, RAID-1 Datasets Relational datasets from different domains 16 KB to 44.9 GB 23
24 runtime [s] speed up Evaluation Performance comparison with SPIDER SINDY SPIDER Speed Up
25 runtime [s] Evaluation Scale-Out Behavior /1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 logicalworkers/physical workers MB-core (5.8 GB) TPC-H (1.4 GB) CATH (907 MB) LOD+ (825 MB) BIOSQLSP (567 MB) WIKIPEDIA (539 MB) CATH (907 MB) CENSUS (111 MB) SCOP (15 MB) COMA (16 KB) 25
26 Scaling Out the Discovery of Inclusion Dependencies Agenda 1. Discovering Inclusion Dependencies 2. Related Work 3. SINDY: A distributed discovery algorithm 4. Evaluation 5. Conclusions
27 Scaling Out the Discovery of Inclusion Dependencies Conclusions Presented new distributed IND discovery algorithm Applicable for unary, n-ary, and partial inclusion dependencies Consists of full outer join calculation and extraction phase Scales well on large datasets Open questions How can one continuously maintain inclusion dependencies? How can the algorithm be applied to similar problems, e.g., RDF data? 27
28 Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
29 Backup Apache Hadoop Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 29
30 Backup Apache Flink/Spark Implementation Map Combine Reduce Split tuples into cells group by value Union attributes Map Combine Reduce split attributes into IND candidates group by dependent attribute intersect referenced attributes 30
31 runtime [s] runtime [s] Backup Column and row scaling behavior MB PDB 1000 MB PDB share of rows # columns 31
32 #INDs runtime [ms] Backup Evaluation: Wikitables 7,000,000,000 6,000,000,000 5,000,000,000 4,000,000,000 3,000,000,000 2,000,000,000 1,000,000,000 0 #INDs runtime [ms] 2,000,000 1,800,000 1,600,000 1,400,000 1,200,000 1,000, , , , , ,000 1,000,000 1,500,000 #columns 32
33 Runtime [s] Backup Evaluation: n-ary INDs n=6 n=5 n=4 40 n=3 n=2 30 n= SCOP COMA CENSUS CATH BIOSQLSP TPC-H remainder 33
Efficiently Identifying Inclusion Dependencies in RDBMS
Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany bauckmann@informatik.hu-berlin.de
More informationParallels Plesk Automation
Parallels Plesk Automation Contents Compact Configuration: Linux Shared Hosting 3 Compact Configuration: Mixed Linux and Windows Shared Hosting 4 Medium Size Configuration: Mixed Linux and Windows Shared
More informationApplication Tool for Experiments on SQL Server 2005 Transactions
Proceedings of the 5th WSEAS Int. Conf. on DATA NETWORKS, COMMUNICATIONS & COMPUTERS, Bucharest, Romania, October 16-17, 2006 30 Application Tool for Experiments on SQL Server 2005 Transactions ŞERBAN
More informationVectorwise 3.0 Fast Answers from Hadoop. Technical white paper
Vectorwise 3.0 Fast Answers from Hadoop Technical white paper 1 Contents Executive Overview 2 Introduction 2 Analyzing Big Data 3 Vectorwise and Hadoop Environments 4 Vectorwise Hadoop Connector 4 Performance
More informationAdvanced SQL Query To Flink Translator
Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department
More informationIgnify ecommerce. Item Requirements Notes
wwwignifycom Tel (888) IGNIFY5 sales@ignifycom Fax (408) 516-9006 Ignify ecommerce Server Configuration 1 Hardware Requirement (Minimum configuration) Item Requirements Notes Operating System Processor
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationLDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationBig Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
More informationCIS 631 Database Management Systems Sample Final Exam
CIS 631 Database Management Systems Sample Final Exam 1. (25 points) Match the items from the left column with those in the right and place the letters in the empty slots. k 1. Single-level index files
More informationBeyond Sampling: Fast, Whole- Dataset Analytics for Big Data on Hadoop
Josh Poduska, Sr. Business Analytics Consultant, Actian Corporation Beyond Sampling: Fast, Whole- Dataset Analytics for Big Data on Hadoop October 2013 KNIME Day Boston Agenda The Age of Data Scaling Gap
More informationPepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010 Agenda Motivation Design Features Applications Evaluation Conclusion
More informationUnique column combinations
Unique column combinations Arvid Heise Guest lecture in Data Profiling and Data Cleansing Prof. Dr. Felix Naumann Agenda 2 Introduction and problem statement Unique column combinations Exponential search
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationVery Large Enterprise Network, Deployment, 25000+ Users
Very Large Enterprise Network, Deployment, 25000+ Users Websense software can be deployed in different configurations, depending on the size and characteristics of the network, and the organization s filtering
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationCovering or Complete? Discovering Conditional Inclusion Dependencies
Covering or Complete? Discovering Conditional Inclusion Dependencies Jana Bauckmann, Ziawasch Abedjan, Ulf Leser, Heiko Müller, Felix Naumann Technische Berichte Nr. 62 des Hasso-Plattner-Instituts für
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationHow To Create A Large Data Storage System
UT DALLAS Erik Jonsson School of Engineering & Computer Science Secure Data Storage and Retrieval in the Cloud Agenda Motivating Example Current work in related areas Our approach Contributions of this
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationFlexible Reliable Affordable Web-scale computing.
bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1 OSCON 2008 Background How bigdata relates to other efforts. Architecture Some examples RDF DB Some examples Web 3.0 Processing Using
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationVery Large Enterprise Network Deployment, 25,000+ Users
Very Large Enterprise Network Deployment, 25,000+ Users Websense software can be deployed in different configurations, depending on the size and characteristics of the network, and the organization s filtering
More informationIncidentMonitor Server Specification Datasheet
IncidentMonitor Server Specification Datasheet Prepared by Monitor 24-7 Inc October 1, 2015 Contact details: sales@monitor24-7.com North America: +1 416 410.2716 / +1 866 364.2757 Europe: +31 088 008.4600
More informationBookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011
BookKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011 What s BookKeeper? Shared storage for writing fast sequences of byte arrays Data is replicated Writes are striped Many processes
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationOverview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National
More informationOracle Platform as a Service and Infrastructure as a Service Public Cloud Service Descriptions-Metered & Non-Metered.
Oracle Platform as a Service and Infrastructure as a Service Public Cloud Service Descriptions-Metered & Non-Metered August 24, 2015 Contents GLOSSARY PUBLIC CLOUD SERVICES-NON-METERED... 4 ORACLE PLATFORM
More informationOverview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and
More informationGraphalytics: A Big Data Benchmark for Graph-Processing Platforms
Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Mihai Capotă, Tim Hegeman, Alexandru Iosup, Arnau Prat-Pérez, Orri Erling, Peter Boncz Delft University of Technology Universitat Politècnica
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationData Management in SAP Environments
Data Management in SAP Environments the Big Data Impact Berlin, June 2012 Dr. Wolfgang Martin Analyst, ibond Partner und Ventana Research Advisor Data Management in SAP Environments Big Data What it is
More informationAdonis Technical Requirements
Information Sheet Adonis Technical Requirements CONTENTS Contents... 1 Adonis Project Implementation... 1 Host Installation / Onboard Installation Full replication (LARGER Vessels):... 1 Onboard installation
More informationSOLUTION BRIEF: SLCM R12.8 PERFORMANCE TEST RESULTS JANUARY, 2013. Submit and Approval Phase Results
SOLUTION BRIEF: SLCM R12.8 PERFORMANCE TEST RESULTS JANUARY, 2013 Submit and Approval Phase Results Table of Contents Executive Summary 3 Test Environment 4 Server Topology 4 CA Service Catalog Settings
More informationPerformance test report
Disclaimer This report was proceeded by Netventic Technologies staff with intention to provide customers with information on what performance they can expect from Netventic Learnis LMS. We put maximum
More informationA stream computing approach towards scalable NLP
A stream computing approach towards scalable NLP Xabier Artola, Zuhaitz Beloki, Aitor Soroa IXA group. University of the Basque Country. LREC, Reykjavík 2014 Table of contents 1
More informationUnderstanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
More informationData Domain Profiling and Data Masking for Hadoop
Data Domain Profiling and Data Masking for Hadoop 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or
More informationPaolo Costa costa@imperial.ac.uk
joint work with Ant Rowstron, Austin Donnelly, and Greg O Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa costa@imperial.ac.uk Paolo Costa CamCube - Rethinking the Data Center
More informationProductivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding
Procedia Computer Science Volume 29, 2014, Pages 2306 2314 ICCS 2014. 14th International Conference on Computational Science Productivity frameworks in big data image processing computations - creating
More informationBIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationBig Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
More informationApache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
More informationIntegrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationAhsay Online Backup Suite v5.0. Whitepaper Backup speed
Ahsay Online Backup Suite v5.0 Version 5.0.1.0 Jan 2006 Table of Content 1 Introduction...3 2 Testing Configuration and Setup...4 2.1 Hardware and Software Setup...4 2.2 Test Scenarios...4 3 Results...5
More informationBusiness Intelligence Extensions for SPARQL
Business Intelligence Extensions for SPARQL Orri Erling (Program Manager, OpenLink Virtuoso) and Ivan Mikhailov (Lead Developer, OpenLink Virtuoso). OpenLink Software, 10 Burlington Mall Road Suite 265
More informationBenchmarking the Availability and Fault Tolerance of Cassandra
Benchmarking the Availability and Fault Tolerance of Cassandra Marten Rosselli, Raik Niemann, Todor Ivanov, Karsten Tolle, Roberto V. Zicari Goethe-University Frankfurt, Germany Frankfurt Big Data Lab
More informationSCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG
SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG MARTIN JUNGHANNS, ANDRE PETERMANN, ERHARD RAHM www.scads.de RESEARCH ON GRAPH ANALYTICS Graph Analytics on Hadoop (Gradoop) Distributed graph data management
More informationParquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various
More informationefiletexas.gov Infrastructure Guidelines
efiletexas.gov Infrastructure Guidelines POWERED BY TYLER TECHNOLOGIES Contents Overview... 3 Internet Bandwidth Guidelines... 4 PC Hardware and Web Browser Guidelines... 5 Storage Guidelines... 6 Odyssey
More informationLegal Notices... 2. Introduction... 3
HP Asset Manager Asset Manager 5.10 Sizing Guide Using the Oracle Database Server, or IBM DB2 Database Server, or Microsoft SQL Server Legal Notices... 2 Introduction... 3 Asset Manager Architecture...
More informationThe Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project
The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project Alastair Duncan STFC Pre Coffee talk STFC July 2014 SCAPE Scalable Preservation Environments The
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationFederated Query Processing over Linked Data
An Evaluation of Approaches to Federated Query Processing over Linked Data Peter Haase, Tobias Mathäß, Michael Ziller fluid Operations AG, Walldorf, Germany i-semantics, Graz, Austria September 1, 2010
More informationMapReduce for Data Warehouses
MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions
More informationCertified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
More informationBIG DATA: A CASE STUDY ON DATA FROM THE BRAZILIAN MINISTRY OF PLANNING, BUDGETING AND MANAGEMENT
BIG DATA: A CASE STUDY ON DATA FROM THE BRAZILIAN MINISTRY OF PLANNING, BUDGETING AND MANAGEMENT Ruben C. Huacarpuma, Daniel da C. Rodrigues, Antonio M. Rubio Serrano, João Paulo C. Lustosa da Costa, Rafael
More informationPricing Guide. Overview FD Enterprise License SaaS Packages Dedicated SaaS Shared SaaS. Page 2 Page 3 Page 4 Page 5 Page 8
Pricing Guide Overview FD Enterprise License SaaS Packages Dedicated SaaS Shared SaaS Page 2 Page 3 Page 4 Page 5 Page 8 Overview Dexero FD is an innovative information management solution that offers
More informationA. Aiken & K. Olukotun PA3
Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationMapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
More informationPromise Pegasus R6 Thunderbolt RAID Performance
Introduction Promise Pegasus R6 Thunderbolt RAID Performance This document summarises the results of some performance testing I carried out on a Pegasus R6 Thunderbolt RAID drive from Promise Technologies
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationCost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
More informationXpresstransfer Online Backup Suite v5.0 Whitepaper Backup speed analysis
Xpresstransfer Online Backup Suite v5.0 Whitepaper Backup speed analysis Version 5.0 Table of Content 1 Introduction...... 3 2 Testing Configuration and Setup 3 2.1 Hardware and Software Setup... 3 2.2
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationBusiness Intelligence and Column-Oriented Databases
Page 12 of 344 Business Intelligence and Column-Oriented Databases Kornelije Rabuzin Faculty of Organization and Informatics University of Zagreb Pavlinska 2, 42000 kornelije.rabuzin@foi.hr Nikola Modrušan
More informationBenchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U
More informationEvaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario
Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario Evaluation report prepared under contract with Dell Introduction Dell introduced its PowerEdge VRTX integrated IT solution for remote-office
More informationAUTOMATIC DATABASE CONSTRUCTION FROM NATURAL LANGUAGE REQUIREMENTS SPECIFICATION TEXT
AUTOMATIC DATABASE CONSTRUCTION FROM NATURAL LANGUAGE REQUIREMENTS SPECIFICATION TEXT Geetha S. 1 and Anandha Mala G. S. 2 1 JNTU Hyderabad, Telangana, India 2 St. Joseph s College of Engineering, Chennai,
More informationSynergis Software 18 South 5 TH Street, Suite 100 Quakertown, PA 18951 +1 215.302.3000, 800.836.5440 www.synergissoftware.com version 20150330
Synergis Software 18 South 5 TH Street, Suite 100 Quakertown, PA 18951 +1 215.302.3000, 800.836.5440 www.synergissoftware.com version 20150330 CONTENTS Contents... 2 Overview... 2 Adept Server... 3 Adept
More informationHadoop on a Low-Budget General Purpose HPC Cluster in Academia
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,
More informationProposal of Dynamic Load Balancing Algorithm in Grid System
www.ijcsi.org 186 Proposal of Dynamic Load Balancing Algorithm in Grid System Sherihan Abu Elenin Faculty of Computers and Information Mansoura University, Egypt Abstract This paper proposed dynamic load
More informationData Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
More informationTesting 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationError Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin
Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin Introduction Context of work: Error-based online failure prediction: error
More informationCA Nimsoft Monitor. Probe Guide for iseries System Statistics Monitoring. sysstat v1.1 series
CA Nimsoft Monitor Probe Guide for iseries System Statistics Monitoring sysstat v1.1 series Legal Notices This online help system (the "System") is for your informational purposes only and is subject to
More informationComparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
More informationHadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
More informationIBM Text Analytics on Apache Spark
IBM Text Analytics on Apache Spark Dimple Bhatia (dimple@us.ibm.com, @dimpbhatia) Sudarshan Thitte (srthitte@us.ibm.com, @trsudarshan) Engineering, Text Analytics, IBM @ Spark Summit 2014 2014 2014 IBM
More informationDEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER
DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND
More informationLinked Open Data Infrastructure for Public Sector Information: Example from Serbia
Proceedings of the I-SEMANTICS 2012 Posters & Demonstrations Track, pp. 26-30, 2012. Copyright 2012 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.
More informationCS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
More informationBig Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
More informationRedundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
More informationSNOW LICENSE MANAGER (7.X)... 3
SYSTEM REQUIREMENTS Products Snow License Manager Snow Automation Platform Snow Device Manager Snow Inventory Server, IDR, IDP Mobile Information Server Client for Windows Client for Linux Client for Unix
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More information