Group at Technische Universität Berlin 1 Introduction Group, in German known by the acronym DIMA, is part of the Department of Software Engineering and Theoretical Computer Science at the TU Berlin. It is led by Prof. Dr. Volker Markl and consists of 3 postdocs, 8 research associates and 19 student assistants. 2 Research Areas Research Group (DIMA) under the direction of Volker Markl conducts research in the areas of information modeling, business intelligence, query processing, query optimization, impact of new hardware architectures on information management, and applications. While having a strong focus on system building and validating research in practical scenarios and use-cases, the group aims at exploring and providing fundamental and theoretically sound solutions to current major research challenges. The group interacts closely with researchers at prestigious national and international academic institutions and carries out joint research projects with leading IT companies, including Hewlett Packard, IBM, and SAP, as well as innovative small and medium enterprises. In the following paragraphs, we present our main research projects. 2.1 Stratosphere Our flagship project is a Collaborative Research Unit funded by the Deutsche Forschungsgemeinschaft (DFG) in which the Technische Universität Berlin, the Humboldt Universität zu Berlin, and the Hasso- Plattner-Institut in Potsdam are jointly researching Information Management on the Cloud. Stratosphere aims at considerably advancing the state-of-art in data processing on parallel, adaptive architectures. Stratosphere (named after the layer of the atmosphere above the clouds) explores the power of massively parallel computing for complex information management applications. Building on the expertise of the participating researchers, we aim to develop a novel, database-inspired approach to analyze, aggregate, and query very large collections of either textual or (semi-)structured data on a virtualized, massively parallel cluster architecture. Stratosphere conducts research in the areas of massively parallel data processing engines, a programming model for parallel data programming, robust optimization of declarative data flow programs, continuous re-optimization and adaptation of the execution, data cleansing, and text mining. The unit will validate its work through a benchmark of the overall system performance and by demonstrators in the areas of climate research, the biosciences and linked open data. The goal of Stratosphere is to jointly research and build a large-scale data processor based on concepts of robust and adaptive execution. We are researching a programming model that extends a functional map/reduce programming model with additional second order functions. As execution plati
form we use the Nephele system, a massively parallel data flow engine which is also researched and developed in the project. We are examining real-world use-cases in the area of climate research, information extraction and integration of unstructured data in the life-sciences, as well as linked open data and social network graph data. 2.2 MIA The German language web consists of more than six billion web sites and is second in size only to the English language web. This vast amount of data could potentially be used for a large number of applications, such as market- and trend analysis, opinion and data mining for Business Intelligence or applications in the domain of language processing technologies. The goal of MIA A Marketplace for Trusted Information and Analysis is to create a marketplace-like infrastructure in which this data is stored, refined and made available in such a way that it enables the trade with refined and agglomerated data and valueadded services. In order to achieve this, we draw upon the results of our substantial research in the areas of Cloud Computing and Information Management. The marketplace provides the German-language web and its history as a data pool for analysis and value-added services. The focus of its initial version are use cases in the domains of media, market research and consulting. These use cases have special requirements of data privacy and security that will be observed. Gradually, the platform will be expanded for additional use cases and services as well as internationalization. The proposed infrastructure enables new business models with information as a tradable good, which build on algorithmic methods that extract information from semi-structured and unstructured data. By using the platform to collaboratively analyze and refine the data of the German-language web, businesses significantly reduce expenses while at the same time jointly creating the basis for a data economy. This will enable even small and medium sized businesses to access and compete in this market. 2.3 GoOLAP.info Today, the Web is one of the world s largest databases. However, due to its textual nature, aggregating and analyzing textual data from the Web analogue to a data warehouse is a hard problem. For instance, users may start from huge amounts of textual data and drill down into tiny sets of specific factual data, may manipulate or share atomic facts, and may repeat this process in an iterative fashion. In the GoOLAP The Web as Data Warehouse project we investigate fundamental problems in the process: What are common analysis operations of end users on natural language Web text? What is the typical iterative process for generating, verifying and sharing factual information from plain Web text? Can we integrate both, the cloud, a cluster of massively parallel working machines, and the crowd, end users of GoOLAP.info, for solving hard problems, such as training 10.000s of fact extractors, for verifying billions of atomic facts or for generating analytical reports from the Web? The current prototype GoOLAP.info contains already factual information from the Web for about several million objects. The keyword-based query interface focuses on simple query intentions, such as, display everything about Airbus or complex aggregation intentions, such as List and compare mergers, acquisitions, competitors and products of airplane technology vendors. 2.4 ROBUST Online communities play a central role in vital business functions such as corporate expertise managements, marketing, product support and customer relationship management. Communities on the web easily grow to millions of users and thus need a scalable infrastructure capable of handling millions of discussion threads containing billions of posts. The EU integrated project ROBUST - Risk and Opportunity Management of huge-scale BUSiness communities develops methods and models to monitor and understand the behavior and requirements of users and groups in these communities. A massively parallel cloud infrastructure will handle ii
the processing and analysis of the community data. Project partners like SAP or IBM host communities for customer support on the internet as well as communities for knowledge management in their intranet, which require highly scalable infrastructures for real time data analysis. DIMA contributes to the areas of massively parallel processing of community data as well as communitybased text analytics and information extraction. 2.5 SCAPE The SCAPE - SCAlable Preservation Environments project will develop scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects. These services will be able to: Identify the need to act to preserve all or parts of a repository through characterisation and trend analysis; Define responses to those needs using formal descriptions of preservation policies and preservation plans; Allow a high degree of automation, virtualisation of tools, and scalable processing; Monitor the quality of preservation processes. The SCAPE consortium brings together experts from memory institutions, data centres, research labs, universities, and industrial firms in order to research and develop scalable preservation systems that can be practically deployed within the next three to five years. SCAPE is dedicated towards producing open source software solutions available to the entire digital preservation community. The project results will be curated and further exploited by the newly founded Open Planets Foundation. Project results will also be exploited by a small-to-medium enterprise and researc institutions within the consortium catering to the preservation community and by two large industrial IT partners. 2.6 BIZWARE Group (DIMA) of the TU Berlin is research partner in the BMBF-funded regional business initiative BIZWARE, in which several industrial partners from Berlin, the TU Berlin and the Fraunhofer Institute FIRST work together to advance a long term scientific and economic development of holistic modelbased software development for the whole software lifecycle. In close collaboration with our industrial partners we will develop the model and software factory and a runtime environment that allows to model, generate and run software components and applications based on domain-specific languages. The goal of the project is to provide innovative technology and methods to automate the phases of software development processes. Within the BIZWARE initiative, TU Berlin works on the sub-project Lifecycle management for BIZWARE applications. The joint project will develop the infrastructure and tools to run, test and configure applications that have been developed with the BIZWARE factory. Furthermore, the results of the project will enable monitoring of the applications in a technical and business manner and provide an environment optimized for end users, test engineers and software operators. Main focus of TU Berlin is to work on software lifecycle management that deals with management of models, software artifacts and components in dynamic repositories 2.7 SINDPAD Parallelization becomes more and more important, even for the architecture of single machines. Recent advances in processor technologies achieve only small performance improvements for single cores. Increasing the compute power of modern architectures mandates to increase the number of compute cores on a single central processing unit (CPU). Graphics Processing Units(GPUs) have a long history of scaleout through parallel processing on many compute cores. Graphics adapters nowadays offer a highly parallel execution environment that within the context iii
of GPGPU (General purpose Processing in Graphics Processing Units) is frequently used in scientific computing. The challenge of GPGPU programming is to design applications for the SIMD architecture (Single Instruction, Multiple Data) of graphics adapters that allow only for a limited range of operators and very limited synchronization mechanisms. In the course of the SINDPAD project, we will develop an indexing and search technology for structured data sets. We will leverage graphics adapters to support query execution. SindPad aims at achieving unprecedented performance compared to conventional systems of equal cost. We consider taking advantage of application characteristics to accelerate data processing. Especially for Business Intelligence (BI) applications, the schema enables the system to store specific data on graphics adapters. This can lead to further speed ups. Researchers of the Database Systems and Information Management (DIMA) group at the TU Berlin will play a significant role in the conceptual planning and implementation of algorithms for hybrid GPU/CPU processing. We will analyze query processing algorithms and devise metrics to compare the performance of GPU-operators and CPU-operators. The SINDPAD: Query Processing on GPUs project is funded by the German Federal Ministry of Economics and Technology and is carried out in cooperation with empulse GmbH. 2.8 ELS Increasingly, standards for railway systems require novel solutions for mainstream problems, such as in the realization of optimal energy efficiency for complex control systems. For example, in order to optimize an ITCS (Intermodal Transport Control System) we will require a centralized computer network system that notifies and evaluates a carriers particular situation to enable analysts to make informed decisions on problems of great interest. Achieving this objective would enable the reduction of traction energy demands. Among the basic components in an ITCS are a centralized computer system, a data communication system, and an on-board computer. The truth is there are numerous influential factors, such as, the position of the vehicle and additional vehicular data (e.g., environmental impact, intermodal roadmap conditions, etc.), which must be considered at the design level to realize significant energy conservation. The evaluation of these influential factors involves real-time communication between the rail-vehicle and the control station. The online-system components are comprised of the parts control centre (ecoc), underground vehicle (ecom) and data communication. A processindependent, post-processing of the operating schedule will have to be ensured by an offline component in the control centre. The offline simulation processes and mechanisms for the analysis of the impact of simulation decisions are part of the offline component. For the transmission of essential data to the board computer in real-time, an interface to the vehicle database will be defined. The system component, ecom, contains in addition a module for supporting the train operator for predictable driving. All functions and programs are bundled and stored in the ecoc manager to support a central energy optimal procedure for rail transport. The reduction of the work data for use by the ITCS central station for situational analysis, the selection, storage and further processing of work data, central optimization, the calculation of management decisions and the administration of failure and management decision proposals will have to be considered. In the ELS - An Optimal Energy Control & Failure Management System project, members of the DIMA group at TU Berlin will play a significant role in the: conceptualization of a knowledge database for relevant operational scenarios, identification and description of data streams, construction of efficient renewal strategies in the event of failures, articulation of functional and technical specifications. Moreover, we will also be involved in the implementation of standardized interfaces for the transiv
mission of ELS data and in performing integration tests. Additionally, interfaces for internal and thirdparty components will have to be carefully designed to meet specific conventions and ensure the optimization of the control system. 3 Teaching At TU-Berlin, we strive to combine teaching with research and practical settings. Undergraduate and graduate coursework offerings include: the usage and implementation of database systems, information modeling and information integration. In addition to standard database classes, we offer many interesting student projects (combining lectures with hands-on practical exercises) in the areas of data warehousing and business intelligence as well as large-scale data analytics and data mining. Our courses cover current research trends, novel developments and research results. For practical exercises, we use both commercial systems and opensource software (e.g., Apache Hadoop and Mahout). The lectures, seminars, and projects offered at DIMA all aim to educate students not only in technology and theory, but also to help foster social skills with respect to team work, project management, and leadership as well as business acumen. Theoretical lectures are accompanied by practical lab courses and exercises, where students learn to work in teams and jointly find solutions to larger problems. We also give students the opportunity to use the skills learned in our courses in practical settings. Because we believe this to be very important, we regularly offer research and teaching assistant positions for both graduate and doctoral students, and help place students at industrial internships with leading international companies. 4 Further Information Further information on teaching and research can be found on the web pages of the DIMA instute at www.dima.tu-berlin.de. v