Wintersemester 2012/2013



Similar documents
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

How To Handle Big Data With A Data Scientist

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

The Sierra Clustered Database Engine, the technology at the heart of

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Hypertable Architecture Overview

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

IBM Netezza High Capacity Appliance

Graph Database Performance: An Oracle Perspective

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

DBpedia German: Extensions and Applications

Topics in basic DBMS course

Fast Analytics on Big Data with H20

Manifest for Big Data Pig, Hive & Jaql

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Advanced In-Database Analytics

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How Companies are! Using Spark

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

CitusDB Architecture for Real-Time Big Data

SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Hadoop. Sunday, November 25, 12

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Benchmarking Cassandra on Violin

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Parallel Computing. Benson Muite. benson.

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Infrastructure Matters: POWER8 vs. Xeon x86

Hybrid Software Architectures for Big

SQL Server 2012 Performance White Paper

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Massive Cloud Auditing using Data Mining on Hadoop

GATE Mímir and cloud services. Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The basic data mining algorithms introduced may be enhanced in a number of ways.

Integrating VoltDB with Hadoop

Trafodion Operational SQL-on-Hadoop

Big Data With Hadoop

HPC data becomes Big Data. Peter Braam

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Big Data Processing with Google s MapReduce. Alexandru Costan

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Actian Vector in Hadoop

Scala Storage Scale-Out Clustered Storage White Paper

The Hadoop Framework

Building your Big Data Architecture on Amazon Web Services

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Benchmarking Hadoop & HBase on Violin

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

In-Memory Data Management for Enterprise Applications

Performance Evaluation of NoSQL Systems Using YCSB in a resource Austere Environment

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

An Approach to Implement Map Reduce with NoSQL Databases

Development and Runtime Platform and High-speed Processing Technology for Data Utilization

IoT is a King, Big data is a Queen and Cloud is a Palace

Open source Google-style large scale data analysis with Hadoop

Oracle Big Data SQL Technical Update

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop & Spark Using Amazon EMR

CSE 233. Database System Overview

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Understanding the Value of In-Memory in the IT Landscape

Scaling Out With Apache Spark. DTL Meeting Slides based on

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

SMB Direct for SQL Server and Private Cloud

Operating System Support for Multiprocessor Systems-on-Chip

Transcription:

1 Wintersemester 2012/2013 Seminare Bachelor Informatik CS 3702 Datenbanken und Anfrageverarbeitung Master Informatik Advanced Topics of Database Systems CS 5840 - Fachübergreifende Kompetenzen = englischsprachiges Seminar CS 5480 - für den Bereich Software Systems Engineering = Seminar Software Systems Engineering Privatdozent Dr. rer. nat. habil. Sven Groppe groppe@ifis.uni-luebeck.de Dipl.-Inf. Stefan Werner werner@ifis.uni-luebeck.de

2 Students Duties Preparation of Slides Preparation of Handout 2 to 3 pages, to be delivered to all participants and to the supervisors directly before the presentation Presentation Approx. 1 hour (inclusive discussion) Attending presentations of all other students Contributions to a lively discussion after each presentation

Timeline participating in all other presentations and contributing to lively discussions 2 weeks Assignment of topics discussion with supervisor about presentation electronic improving submission presentation of pdfs of and handout presentation according to and handout supervisor s via Email comments (final from and remarks student s view) presentation, electronic submission of source files and pdfs of slides and handout to the supervisor via Email 3

4 Topics Bachelor/Master Seminars Green topics -> seminar for bachelor students Blue topics -> seminar for master students Topics may be exchanged to be discussed with the supervisors Bachelor students choosing blue topics Master students choosing green topics

5 Overview Topics FPGA Cloud Computing Semantic Web B + -tree Compression Regular Expressions Relational Databases Recursive Queries

6 FPGA (Field Programmable Gate Arrays) integrated circuit (IC) configuration after manufacturing hardware description language (HDL) complex functional blocks, arranged in periodic structure interconnection network

7 FPGA (Field Programmable Gate Arrays) Advantages towards General CPUs inherent parallism heavy throughput at low clock rate high energy efficency Reconfigurable computing = offload expensive tasks from software to FPGA

8 FPGA (Field Programmable Gate Arrays) -Topics - A Memory-Balanced Linear Pipeline Architecture for Triebased IP Lookup (HOTI 2007) High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA (FPGA 2010) Massively Parallel XML Twig Filtering Using Dynamic Programming on FPGAs (ICDE 2010) How Soccer Players Would Do Stream Joins (SIGMOD 2011) Evaluating FPGA-acceleration for Real-time Unstructured Search (ISPASS 2012) Skeleton Automata for FPGAs: Reconfiguring without Reconstruction (SIGMOD 2012)

9 Cloud Computing typically, cloud provider offers online-services multiple serverbased computational resources via a digital network like internet applications are provided and managed by the cloud server data is stored remotely in the cloud configuration source: Wikipedia

10 Cloud computing - Topics Walnut: A Unified Cloud Object Store (SIGMOD 2012) Developed at Yahoo! with the goal of serving as a common low-level storage layer for a variety of cloud data management systems (e.g. Hadoop, MObStor, PNUTS) => Enabling sharing hardware resources between different types of clouds. CloudRAMSort: Fast and Efficient Large-Scale Distributed RAM Sort on Shared-Nothing Cluster (SIGMOD 2012) Sort algorithm, which sorts 1 Terabyte of data in 4.6 seconds PIQL: Success-Tolerant Query Processing in the Cloud (VLDB 2012) Newly-released web applications often have a Success Disaster, where overloaded database systems destroy a previously good user experience, i.e. too much success causes performance problems. The authors propose PIQL for providing scale independance for queries, such that applications become success-tolerant.

Semantic Web Idea "web of data" that enables machines to understand the semantics, or meaning, of information on the World Wide Web. extends the network of hyperlinked human-readable web pages by inserting machine-readable metadata about pages and how they are related to each other Semantic Web databases can be seen as graph databases for labelled and directed graphs Additional implicit data is considered based on ontology inference E.g. Peter has the driverlicenseno 456 (which is a fact) and only a person can have a driverlicenseno (expressed in an ontology), then Peter is a person (implicitly present) 11

12 Semantic Web - Topics DBpedia SPARQL Benchmark Performance Assessment with Real Queries on Real Data (ISWC 2011) DBpedia contains data extracted from Wikipedia Benchmark queries are determined from frequently used queries for a SPARQL querying web service (endpoint) ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints (ISWC 2011) ANAPSID provides physical operators that detect when a source becomes blocked or data traffic is bursty, and produce results as quickly as data arrives from the sources Castor: A Constraint-Based SPARQL Engine with Active Filter Processing (ESWC 2011) Constraint Programming was originally designed for solving NP-hard problems, but the authors adapt it to SPARQL processing by exploiting filters early-on instead of a post-processing step

13 Semantic Web - Topics View Selection in Semantic Web Databases (VLDB 2012) Use existing views (=materialized results of previous queries) to answer a new query Simple ontology inference is considered The Complexity of Evaluating Path Expressions in SPARQL (PODS 2012) SPARQL 1.1 allows to query nodes in arbitrary depth. This paper studies its complexity for W3C-conform evaluation and presents an alternative. Static Analysis and Optimization of Semantic Web Queries (PODS 2012) Study of Containment, Equivalence and Evaluation of SPARQL queries

14 B + -tree Variant of the B-tree Self-balancing, block-oriented search tree for large datasets Keys 3? 11 2 4 6 10 12 14 8 Values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

15 B + -tree - Topics B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives (VLDB 2011) B + -tree variant for SSDs using parallel I/O

Compression Window-based Dictionaries Example: LZ77 for elimination of repeated sequences (simplified) Compression ANANAS : Decompression: Sliding Window Compressed Dictionary Lookahead Data 01234567 ANAN A NANA AN ANAS ANANA S ANANAS Position in Dictionary (0,0,A) (0,0,N) (6,2,A) (0,0,S) 01234567 A no compression => "A" AN no compression => "N" ANANA copy next 2 characters after Pos. 6 and "A" ANANAS no compression => "S" Next Character Length of repetition 16

17 Compression - Topics Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections (VLDB 2011) Instead of window-based dictionaries, a dictionary is generated by sampling the whole document

18 Regular expressions Regular expressions contain operators for concatenation, union, choice and Kleene star Used to define searches/matchings in file systems databases programming languages (e.g. tokens while compiling and api support)

19 Regular Expressions - Topics Deterministic Regular Expressions in Linear Time (PODS 2012) Algorithms to test determinism and matching

Relational Database Most widely used type of database Data model Relation (table) Name ID Adress student1 1 HL student2 2 HH Students ID Lecture-ID 1 DB1 2 MVDB2 Lectures Query language SQL 20

21 Relational Database - Topics NoDB: Efficient Query Execution on Raw Data Files (SIGMOD 2012) NoDB avoids data loading while still maintaining the whole feature set of modern databases, i.e., NoDB works on raw data files. SharedDB: Killing One Thousand Queries With One Stone (VLDB 2012) deals with multi-query optimization, i.e., query execution plans for many queries to be executed at the same time are generated. High throughput for high workloads

22 Recursive Queries Datalog as subset of prolog is a rule language for deductive databases Example: Facts: parent(bill,mary). parent(mary,john). Rules: Query: ancestor(x,y) :- parent(x,y). ancestor(x,y) :- parent(x,z), ancestor(z,y).?- ancestor(bill,x). Topic: More Efficient Datalog Queries: Subsumptive Tabling Beats Magic Sets (SIGMOD 2011) top-down evaluation method with more reuse of answers than the dominant tabling strategy

23 Recursive Queries Recursive query constructs part of ANSI SQL 99 Topic: Adaptive Optimization of Recursive Queries in Teradata (SIGMOD 12) Query execution plan is optimized within iterations of recursive queries

Bachelor Master Wintersemester 2012/2013: Seminare CS 3702, CS 5840, CS 5480 FPGA A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup (HOTI 2007) High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA (FPGA 2010) Massively Parallel XML Twig Filtering Using Dynamic Programming on FPGAs (ICDE 2010) How Soccer Players Would Do Stream Joins (SIGMOD 2011) Evaluating FPGA-acceleration for Real-time Unstructured Search (ISPASS 2012) Skeleton Automata for FPGAs: Reconfiguring without Reconstruction (SIGMOD 2012) Cloud Computing Walnut: A Unified Cloud Object Store (SIGMOD 2012) CloudRAMSort: Fast and Efficient Large-Scale Distributed RAM Sort on Shared-Nothing Cluster (SIGMOD 2012) PIQL: Success-Tolerant Query Processing in the Cloud (VLDB 2012) Semantic Web DBpedia SPARQL Benchmark Performance Assessment with Real Queries on Real Data (ISWC 2011) ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints (ISWC 2011) Castor: A Constraint-Based SPARQL Engine with Active Filter Processing (ESWC 2011) Semantic Web - View Selection in Semantic Web Databases (VLDB 2012) - The Complexity of Evaluating Path Expressions in SPARQL (PODS 2012) - Static Analysis and Optimization of Semantic Web Queries (PODS 2012) B + -tree B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives (VLDB 2011) Compression Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections (VLDB 2011) Regular Expressions Deterministic Regular Expressions in Linear Time (PODS 2012) Relational Database NoDB: Efficient Query Execution on Raw Data Files (SIGMOD 2012) SharedDB: Killing One Thousand Queries With One Stone (VLDB 2012) Recursive Queries - More Efficient Datalog Queries: Subsumptive Tabling Beats Magic Sets (SIGMOD 2011) - Adaptive Optimization of Recursive Queries in Teradata (SIGMOD 12) More topics upon request 24