The Database Systems and Information Management Group at Technische Universität Berlin



Similar documents
Big Data Analytics. Chances and Challenges. Volker Markl

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Data Centric Systems (DCS)

Understanding the Value of In-Memory in the IT Landscape

A Professional Big Data Master s Program to train Computational Specialists

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

JOURNAL OF OBJECT TECHNOLOGY

HPC technology and future architecture

Big Data - Infrastructure Considerations

Integrating SAP and non-sap data for comprehensive Business Intelligence

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Information Architecture

Apache Hadoop: The Big Data Refinery

Big Data and the Data Lake. February 2015

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

ANALYTICS STRATEGY: creating a roadmap for success

How To Handle Big Data With A Data Scientist

The Liaison ALLOY Platform

Virtualizing Apache Hadoop. June, 2012

This Symposium brought to you by

Data Refinery with Big Data Aspects

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data and Healthcare Payers WHITE PAPER

Analance Data Integration Technical Whitepaper

ISSN: International Journal of Innovative Research in Technology & Science(IJIRTS)

Interactive data analytics drive insights

Business Intelligence In SAP Environments

Concept and Project Objectives

BSC vision on Big Data and extreme scale computing

Infrastructure Matters: POWER8 vs. Xeon x86

Manifest for Big Data Pig, Hive & Jaql

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Dell* In-Memory Appliance for Cloudera* Enterprise

Luncheon Webinar Series May 13, 2013

Big Data Integration: A Buyer's Guide

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Senior Business Intelligence/Engineering Analyst

Massive Cloud Auditing using Data Mining on Hadoop

Agile Business Intelligence Data Lake Architecture

BIG DATA TRENDS AND TECHNOLOGIES

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data: Overview and Roadmap eglobaltech. All rights reserved.

Data Integration Checklist

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

IBM System x reference architecture solutions for big data

EL Program: Smart Manufacturing Systems Design and Analysis

Big Data Architect Certification Self-Study Kit Bundle

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

How to Enhance Traditional BI Architecture to Leverage Big Data

Analance Data Integration Technical Whitepaper

Implementing a Data Warehouse with Microsoft SQL Server 2012

Microsoft Big Data. Solution Brief

MicroStrategy Course Catalog

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

How To Get The Most Out Of Big Data

90% of your Big Data problem isn t Big Data.

Advanced Analytics. The Way Forward for Businesses. Dr. Sujatha R Upadhyaya

Actian SQL in Hadoop Buyer s Guide

How To Turn Big Data Into An Insight

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

Big Data 101: Harvest Real Value & Avoid Hollow Hype

locuz.com Big Data Services

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

Whitepaper Data Governance Roadmap for IT Executives Valeh Nazemoff

A discussion of information integration solutions November Deploying a Center of Excellence for data integration.

Implementing a Data Warehouse with Microsoft SQL Server 2012

Scalability and Performance Report - Analyzer 2007

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Data Science Certificate Program

Implement Hadoop jobs to extract business value from large and varied data sets

Cisco Data Preparation

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

NSF Workshop: High Priority Research Areas on Integrated Sensor, Control and Platform Modeling for Smart Manufacturing

Big Data must become a first class citizen in the enterprise

Data processing goes big

Big Data at Cloud Scale

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Big Data Use Case: Business Analytics

PRIME DIMENSIONS. Revealing insights. Shaping the future.

Customized Report- Big Data

Reimagining Business with SAP HANA Cloud Platform for the Internet of Things

MDM and Data Warehousing Complement Each Other

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

Harnessing the power of advanced analytics with IBM Netezza

SQL Server 2012 Business Intelligence Boot Camp

AppSymphony White Paper

FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Relational Databases in the Cloud

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Handling Big(ger) Logs: Connecting ProM 6 to Apache Hadoop

The Big Data methodology in computer vision systems

Transcription:

Group at Technische Universität Berlin 1 Introduction Group, in German known by the acronym DIMA, is part of the Department of Software Engineering and Theoretical Computer Science at the TU Berlin. It is led by Prof. Dr. Volker Markl and consists of 3 postdocs, 8 research associates and 19 student assistants. 2 Research Areas Research Group (DIMA) under the direction of Volker Markl conducts research in the areas of information modeling, business intelligence, query processing, query optimization, impact of new hardware architectures on information management, and applications. While having a strong focus on system building and validating research in practical scenarios and use-cases, the group aims at exploring and providing fundamental and theoretically sound solutions to current major research challenges. The group interacts closely with researchers at prestigious national and international academic institutions and carries out joint research projects with leading IT companies, including Hewlett Packard, IBM, and SAP, as well as innovative small and medium enterprises. In the following paragraphs, we present our main research projects. 2.1 Stratosphere Our flagship project is a Collaborative Research Unit funded by the Deutsche Forschungsgemeinschaft (DFG) in which the Technische Universität Berlin, the Humboldt Universität zu Berlin, and the Hasso- Plattner-Institut in Potsdam are jointly researching Information Management on the Cloud. Stratosphere aims at considerably advancing the state-of-art in data processing on parallel, adaptive architectures. Stratosphere (named after the layer of the atmosphere above the clouds) explores the power of massively parallel computing for complex information management applications. Building on the expertise of the participating researchers, we aim to develop a novel, database-inspired approach to analyze, aggregate, and query very large collections of either textual or (semi-)structured data on a virtualized, massively parallel cluster architecture. Stratosphere conducts research in the areas of massively parallel data processing engines, a programming model for parallel data programming, robust optimization of declarative data flow programs, continuous re-optimization and adaptation of the execution, data cleansing, and text mining. The unit will validate its work through a benchmark of the overall system performance and by demonstrators in the areas of climate research, the biosciences and linked open data. The goal of Stratosphere is to jointly research and build a large-scale data processor based on concepts of robust and adaptive execution. We are researching a programming model that extends a functional map/reduce programming model with additional second order functions. As execution plati

form we use the Nephele system, a massively parallel data flow engine which is also researched and developed in the project. We are examining real-world use-cases in the area of climate research, information extraction and integration of unstructured data in the life-sciences, as well as linked open data and social network graph data. 2.2 MIA The German language web consists of more than six billion web sites and is second in size only to the English language web. This vast amount of data could potentially be used for a large number of applications, such as market- and trend analysis, opinion and data mining for Business Intelligence or applications in the domain of language processing technologies. The goal of MIA A Marketplace for Trusted Information and Analysis is to create a marketplace-like infrastructure in which this data is stored, refined and made available in such a way that it enables the trade with refined and agglomerated data and valueadded services. In order to achieve this, we draw upon the results of our substantial research in the areas of Cloud Computing and Information Management. The marketplace provides the German-language web and its history as a data pool for analysis and value-added services. The focus of its initial version are use cases in the domains of media, market research and consulting. These use cases have special requirements of data privacy and security that will be observed. Gradually, the platform will be expanded for additional use cases and services as well as internationalization. The proposed infrastructure enables new business models with information as a tradable good, which build on algorithmic methods that extract information from semi-structured and unstructured data. By using the platform to collaboratively analyze and refine the data of the German-language web, businesses significantly reduce expenses while at the same time jointly creating the basis for a data economy. This will enable even small and medium sized businesses to access and compete in this market. 2.3 GoOLAP.info Today, the Web is one of the world s largest databases. However, due to its textual nature, aggregating and analyzing textual data from the Web analogue to a data warehouse is a hard problem. For instance, users may start from huge amounts of textual data and drill down into tiny sets of specific factual data, may manipulate or share atomic facts, and may repeat this process in an iterative fashion. In the GoOLAP The Web as Data Warehouse project we investigate fundamental problems in the process: What are common analysis operations of end users on natural language Web text? What is the typical iterative process for generating, verifying and sharing factual information from plain Web text? Can we integrate both, the cloud, a cluster of massively parallel working machines, and the crowd, end users of GoOLAP.info, for solving hard problems, such as training 10.000s of fact extractors, for verifying billions of atomic facts or for generating analytical reports from the Web? The current prototype GoOLAP.info contains already factual information from the Web for about several million objects. The keyword-based query interface focuses on simple query intentions, such as, display everything about Airbus or complex aggregation intentions, such as List and compare mergers, acquisitions, competitors and products of airplane technology vendors. 2.4 ROBUST Online communities play a central role in vital business functions such as corporate expertise managements, marketing, product support and customer relationship management. Communities on the web easily grow to millions of users and thus need a scalable infrastructure capable of handling millions of discussion threads containing billions of posts. The EU integrated project ROBUST - Risk and Opportunity Management of huge-scale BUSiness communities develops methods and models to monitor and understand the behavior and requirements of users and groups in these communities. A massively parallel cloud infrastructure will handle ii

the processing and analysis of the community data. Project partners like SAP or IBM host communities for customer support on the internet as well as communities for knowledge management in their intranet, which require highly scalable infrastructures for real time data analysis. DIMA contributes to the areas of massively parallel processing of community data as well as communitybased text analytics and information extraction. 2.5 SCAPE The SCAPE - SCAlable Preservation Environments project will develop scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects. These services will be able to: Identify the need to act to preserve all or parts of a repository through characterisation and trend analysis; Define responses to those needs using formal descriptions of preservation policies and preservation plans; Allow a high degree of automation, virtualisation of tools, and scalable processing; Monitor the quality of preservation processes. The SCAPE consortium brings together experts from memory institutions, data centres, research labs, universities, and industrial firms in order to research and develop scalable preservation systems that can be practically deployed within the next three to five years. SCAPE is dedicated towards producing open source software solutions available to the entire digital preservation community. The project results will be curated and further exploited by the newly founded Open Planets Foundation. Project results will also be exploited by a small-to-medium enterprise and researc institutions within the consortium catering to the preservation community and by two large industrial IT partners. 2.6 BIZWARE Group (DIMA) of the TU Berlin is research partner in the BMBF-funded regional business initiative BIZWARE, in which several industrial partners from Berlin, the TU Berlin and the Fraunhofer Institute FIRST work together to advance a long term scientific and economic development of holistic modelbased software development for the whole software lifecycle. In close collaboration with our industrial partners we will develop the model and software factory and a runtime environment that allows to model, generate and run software components and applications based on domain-specific languages. The goal of the project is to provide innovative technology and methods to automate the phases of software development processes. Within the BIZWARE initiative, TU Berlin works on the sub-project Lifecycle management for BIZWARE applications. The joint project will develop the infrastructure and tools to run, test and configure applications that have been developed with the BIZWARE factory. Furthermore, the results of the project will enable monitoring of the applications in a technical and business manner and provide an environment optimized for end users, test engineers and software operators. Main focus of TU Berlin is to work on software lifecycle management that deals with management of models, software artifacts and components in dynamic repositories 2.7 SINDPAD Parallelization becomes more and more important, even for the architecture of single machines. Recent advances in processor technologies achieve only small performance improvements for single cores. Increasing the compute power of modern architectures mandates to increase the number of compute cores on a single central processing unit (CPU). Graphics Processing Units(GPUs) have a long history of scaleout through parallel processing on many compute cores. Graphics adapters nowadays offer a highly parallel execution environment that within the context iii

of GPGPU (General purpose Processing in Graphics Processing Units) is frequently used in scientific computing. The challenge of GPGPU programming is to design applications for the SIMD architecture (Single Instruction, Multiple Data) of graphics adapters that allow only for a limited range of operators and very limited synchronization mechanisms. In the course of the SINDPAD project, we will develop an indexing and search technology for structured data sets. We will leverage graphics adapters to support query execution. SindPad aims at achieving unprecedented performance compared to conventional systems of equal cost. We consider taking advantage of application characteristics to accelerate data processing. Especially for Business Intelligence (BI) applications, the schema enables the system to store specific data on graphics adapters. This can lead to further speed ups. Researchers of the Database Systems and Information Management (DIMA) group at the TU Berlin will play a significant role in the conceptual planning and implementation of algorithms for hybrid GPU/CPU processing. We will analyze query processing algorithms and devise metrics to compare the performance of GPU-operators and CPU-operators. The SINDPAD: Query Processing on GPUs project is funded by the German Federal Ministry of Economics and Technology and is carried out in cooperation with empulse GmbH. 2.8 ELS Increasingly, standards for railway systems require novel solutions for mainstream problems, such as in the realization of optimal energy efficiency for complex control systems. For example, in order to optimize an ITCS (Intermodal Transport Control System) we will require a centralized computer network system that notifies and evaluates a carriers particular situation to enable analysts to make informed decisions on problems of great interest. Achieving this objective would enable the reduction of traction energy demands. Among the basic components in an ITCS are a centralized computer system, a data communication system, and an on-board computer. The truth is there are numerous influential factors, such as, the position of the vehicle and additional vehicular data (e.g., environmental impact, intermodal roadmap conditions, etc.), which must be considered at the design level to realize significant energy conservation. The evaluation of these influential factors involves real-time communication between the rail-vehicle and the control station. The online-system components are comprised of the parts control centre (ecoc), underground vehicle (ecom) and data communication. A processindependent, post-processing of the operating schedule will have to be ensured by an offline component in the control centre. The offline simulation processes and mechanisms for the analysis of the impact of simulation decisions are part of the offline component. For the transmission of essential data to the board computer in real-time, an interface to the vehicle database will be defined. The system component, ecom, contains in addition a module for supporting the train operator for predictable driving. All functions and programs are bundled and stored in the ecoc manager to support a central energy optimal procedure for rail transport. The reduction of the work data for use by the ITCS central station for situational analysis, the selection, storage and further processing of work data, central optimization, the calculation of management decisions and the administration of failure and management decision proposals will have to be considered. In the ELS - An Optimal Energy Control & Failure Management System project, members of the DIMA group at TU Berlin will play a significant role in the: conceptualization of a knowledge database for relevant operational scenarios, identification and description of data streams, construction of efficient renewal strategies in the event of failures, articulation of functional and technical specifications. Moreover, we will also be involved in the implementation of standardized interfaces for the transiv

mission of ELS data and in performing integration tests. Additionally, interfaces for internal and thirdparty components will have to be carefully designed to meet specific conventions and ensure the optimization of the control system. 3 Teaching At TU-Berlin, we strive to combine teaching with research and practical settings. Undergraduate and graduate coursework offerings include: the usage and implementation of database systems, information modeling and information integration. In addition to standard database classes, we offer many interesting student projects (combining lectures with hands-on practical exercises) in the areas of data warehousing and business intelligence as well as large-scale data analytics and data mining. Our courses cover current research trends, novel developments and research results. For practical exercises, we use both commercial systems and opensource software (e.g., Apache Hadoop and Mahout). The lectures, seminars, and projects offered at DIMA all aim to educate students not only in technology and theory, but also to help foster social skills with respect to team work, project management, and leadership as well as business acumen. Theoretical lectures are accompanied by practical lab courses and exercises, where students learn to work in teams and jointly find solutions to larger problems. We also give students the opportunity to use the skills learned in our courses in practical settings. Because we believe this to be very important, we regularly offer research and teaching assistant positions for both graduate and doctoral students, and help place students at industrial internships with leading international companies. 4 Further Information Further information on teaching and research can be found on the web pages of the DIMA instute at www.dima.tu-berlin.de. v