Analysis Pipelines for Benchmarking Big Data Systems

Transcription

1 Analysis Pipelines for Benchmarking Big Data Systems Thomas Bodner Ref. Code: Berlin_EN_2182, Suggested starting date: May 15, 2013 Today, practically everyone ranging from web companies to traditional enterprises to natural science researchers to anthropologists, is either already experiencing or at least anticipating challenges due to two ongoing trends. First, the data that is acquired, stored, and analyzed to gain insight and value becomes increasingly voluminous, diverse, and timecritical leading to the coinage of the term of Big Data. Second, the depth of analytics over this Big Data steadily increases to drive processes, e.g. for spam detection and advertisement placement. Within the past few years, industrial and academic organizations designed a wealth of new systems to cope with these challenges. These include MapReduce [1], SCOPE [2], ASTERIX [3], and Stratosphere [4]; the latter being developed at our group. Right now, there is no generally accepted way to benchmark such systems. In our previous work, we implemented Myriad [5], a toolkit for the creation of parallel data generators and Big Darwin, a repository for workloads from different domains in this space, e.g. machine learning and graph analysis. As a next step towards standardized benchmarks in this field, we plan to conceptualize and implement representative data analysis pipelines as they are commonly executed by Big Data users such as Walmart, Google, or CERN. Such pipelines consist of related jobs that potentially perform data transformations, relational algebra operations, or multi-stage machine learning / statistical algorithms. In this project, we will specify and implement analytic pipelines for the massively-parallel data analysis system Stratosphere, and quantitatively compare it to Big Data stacks developed elsewhere (e.g., Spark from UC Berkeley or ASTERIX from UC Irvine) on the basis of resulting pipelines in a cluster or cloud (like Amazon EC2) setting. Candidates for this project should be interested in MapReduce, database internals, machine learning, and have good programming skills in Java. [1]! Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: [2]! Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, Darren Shakib: SCOPE: parallel databases meet MapReduce. VLDB J. (VLDB) 21(5): (2012) [3]! Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases (DPD) 29(3): (2011) [4]! Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010: [5]! Alexander S. Alexandrov, Berni Schiefer, John Poelman, Stephan Ewen, Thomas Bodner, Volker Markl: Myriad - Parallel Data Generation on Shared-Nothing Architectures. Proceedings of the First Workshop on Architectures and Systems for Big Data (ASBD), 2011

2 Implementing a DSL for Machine Learning on top of a Massively-Parallel Data Processing System Alexander Alexandrov alexander.alexandrov@tu-berlin.de Ref. Code: Berlin_EN_2200, Suggested starting date: May 15, 2013 The boom of digital data in the past decades has triggered unprecedented interest in disciplines that provide tools to discover and extract useful patterns from the expanding sea of information. Machine Learning is probably the most prominent example of such discipline. A Machine Learning cycle consists of a learning phase, where the parameters of a statistical model are fitted to training data, and a discovery phase, where the trained model is used to predict features of new, unseen data. Empirical evidence shows that the quality of many Machine Learning algorithms improves with the size of the training data, and that a simple algorithm is likely to outperform a more complex one if it gets more training data [1]. Modern Machine Learning applications therefore should be able to scale to very large sets of training data. In the past few years, several projects have demonstrated that general-purpose parallel computation models like Google s MapReduce (and its open-source implementation Hadoop) [2,3] can be used to implement highly scalable, fault- tolerant machine learning algorithms [4,5]. The Stratosphere system [6] provides an extension of the MapReduce programming model. In Stratosphere, the data processing pipelines are specified as directed acyclic graphs, and the nodes of the graph are higher-order functions called parallelization contracts (PACTs), e.g. Map and Reduce, parametrized with UDFs that handle the actual computation. We are currently building a new Stratosphere optimizer that will allow efficient specification of Domain Specific Languages (DSLs) on top of the PACT programming model. The goal of this project is to implement a small DSL for Machine Learning on top of the Stratosphere optimizer. In the course of the project, you will (A) specify an algebra of suitable operators for the Machine Learning DSL (e.g. matrix multiplication, matrix decomposition), (B) provide PACT implementations for the algebra, and (C) find algebraic equivalences that can be used by the optimizer for rewriting. Candidates for this project should be interested in MapReduce, database query optimization and linear algebra, and have good programming skills in Java. [1]! A. Halevy, P. Norvig, F. Pereira: "The Unreasonable Effectiveness of Data". IEEE Intelligent Systems, pp. 8-12, March/April, 2009 [2]! J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 [3]! Apache Hadoop. [4]! A. Ghoting, R. Krishnamurthy, E. P. D. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, S. Vaithyanathan, SystemML: Declarative machine learning on MapReduce. ICDE 2011: [5]! Apache Mahout: Scalable machine learning and data mining. [6]! Stratosphere: Above the Clouds.

3 GPU-assisted model optimization in PostgreSQL Max Heimel Ref. Code: Berlin_EN_2221, Suggested starting date: June 01, 2013 In Machine Learning, statistical models are fitted - or trained - to observations. A trained model can be used to predict features of unseen data or to discover new aspects of observed data. Specific model parameters are typically learned using gradient-based numerical optimization algorithms like Gradient Descent or Quasi-Newton methods: Given the gradient of an objective function, these methods converge to an optimal set of parameters that fit the model to the given training data. The quality of Machine Learning algorithms improves with the available amount of training data [1]. Today, models are usually trained using statistical / numerical systems such as R or Matlab. When training data is kept in a database, this approach requires data transfer between different systems; a process that does not scale well to large data volumes. In order to avoid this transfer, it is helpful to push model training directly into the database kernel [2]. While this approach reduces the need to im- and export data, it also adds nontrivial computational load for the database. To alleviate this problem, we propose to use graphics cards as "model-training co-processors" for the database kernel [3]. Graphics cards are well-suited to accelerate the computations encountered in model training [4] and can therefore help to reduce the additional load that is put on the database. The goal of this project is to implement gradient-based optimization routines on a graphics card, integrate them into the open-source database engine PostgreSQL [5] and evaluate the performance of this integrated optimization module. You should be interested in machine learning, numerical optimization and parallel programming on graphics cards using OpenCL [6], and have - at least basic - programming experience in C/C++ (or Java). [1]! Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2), [2]! Hellerstein, J. M., Ré, C., Schoppmann, F., Wang, D. Z., Fratkin, E., Gorajek, A.,... & Kumar, A. (2012). The MADlib analytics library: or MAD skills, the SQL. Proceedings of the VLDB Endowment, 5(12), [3]! Heimel, M., & Markl, V. (2012). A First Step Towards GPU-assisted Query Optimization. [4]! Steinkraus, D., Buck, I., & Simard, P. Y. (2005, August). Using GPUs for machine learning algorithms. In Document Analysis and Recognition, Proceedings. Eighth International Conference on (pp ). IEEE. [5]! [6]!

4 Using Compiler Techniques for Database Query Optimization Marcus Leich Kostas Tzoumas Ref. Code: Berlin_EN_2408, Suggested starting date: May 15, 2013 During the last years, we have seen a boom in data availability, coupled by a plunging cost of storage and computing. These trends, accelerated by the opportunity of cloud computing are delivering the "Big Data Analytics" promise: the ability to analyze data of an unprecedented volume and variety. This data is usually heterogeneous, from structured enterprise data, to unstructured Web and text data, and graph data from social networks. Traditional database systems were designed for tightly structured data, and are not a good match for such variety. In addition, the volume of such data necessitates the use of massively distributed computation platforms. In reaction to these trends, a new breed of parallel data processing systems such as MapReduce/ Hadoop [3], Asterix[2], Scope[6], and Stratosphere[1] have emerged, the latter being developed at our group in TU Berlin. These systems have typically in their core a flexible functional dataflow model that allows many degrees of freedom to the programmer. Typically, programmers are free to write the data analysis code in a full-fledged imperative language such as Java. This poses challenges to the traditional query optimizer designs that typically assume an algebraic query language. In our previous work, we provided some building blocks for analyzing the code of user- defined code using program analysis, and using the outcomes for this analysis to emulate transformations done by traditional query optimizers without knowing full semantics [5]. The Scope system at Microsoft has also arrived at similar techniques [4]. The goal of this project is to go beyond traditional query optimizers, by taking a holistic view at program optimization including compiler and database query optimization. We will investigate (1) intrusive techniques that split user code to pieces and optimize pieces independently, and (2) extracting semantics of user code in order to estimate resource consumption. The ideal candidate should have experience with Java programming, database internals (query optimization), compilers, and program optimization. Experience with the Soot framework for program analysis is a plus. [1]! Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010: [2]! Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases (DPD) 29(3): (2011) [3]! Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: [4]! Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. Spotting Code Optimizations in Data- Parallel Pipelines through PeriSCOPE. OSDI 2012 [5]! Fabian Hueske, Mathias Peters, Matthias Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, Kostas Tzoumas: Opening the Black Boxes in Data Flow Optimization. PVLDB 5(11): (2012) [6]! Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Åke Larson, Ronnie Chaiken, Darren Shakib: SCOPE: parallel databases meet MapReduce. VLDB J. (VLDB) 21(5): (2012)

5 Large Scale Data Mining with Iterative Data Flows Sebastian Schelter Ref. Code: BerlinGermany_EN_2414, Suggested starting date: May 16, 2013 During the last decade, the cost of acquiring and storing huge amounts of data has dropped significantly. As the technologies to process and analyze these datasets are being developed, we face previously unimaginable new possibilities. Scientists can test hypotheses on data several orders of magnitude larger than before, and formulate hypotheses by exploring large data sets. The basis for developing the technology to enable such analysis is rooted in the fields of database systems, information management, machine learning and distributed systems. While the first generation of parallel processing platforms has focused on aggregation and relational queries, there is a growing demand for applying more complex analytics. The tools to derive these come in the form of complex mathematical algorithms from the fields of machine learning, information extraction and graph mining. The majority of these algorithms are composed of applying mathematical functions iteratively to the data until convergence. Enormous amounts of training data are used to successively refine a mathematical prediction model. It is desirable to handle iterative algorithms in general, multi-purpose distributed data flow systems. The Stratosphere system [1] is a database-inspired parallel processing stack for Big Data analytics. The system is part of a large-scale research project lead by our group at TU Berlin. We are currently investigating the efficient execution of iterative data flows [2]. These data flows can be composed of second-order functions, such as the commonly known map and reduce, but also of more advanced functions, e.g. for grouping and matching several inputs. The goal of this project is to implement several data mining algorithms from the fields of social network analysis [3] and collaborative filtering [4] using Stratosphere s iterative data flows. The implementations have to be benchmarked on several public datasets. Furthermore we will investigate selected interesting aspects regarding the compilation, optimization and fault tolerance of the iterative data flows. [1]! Alexander Alexandrov, Dominic Battré, Stephan Ewen, Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik Nijkamp, Daniel Warneke: Massively Parallel Data Analysis with PACTs on Nephele, VLDB, 2010 [2]! Stephan Ewen, Moritz Kaufmann, Kostas Tzoumas: Spinning Fast Iterative Data Flows, VLDB, 2012 [3]! Kang, U., Tsourakakis, C. E., & Faloutsos, C.: PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. ICDM, 2009 [4]! Yehuda Koren, Robert M. Bell, Chris Volinsky: Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 2009

6 Conditions and Application Terms: Monthly scholarship of 650 Allowance of 160 for a Rail Pass Health insurance as well as accident and personal liability insurance Flights to and from Germany are not covered Starting dates may be adjusted to you needs, but have to be between May 15 and early July All projects have a duration of 2-3 months Requirements: 1. You are currently enrolled at a university/college in the United States, Canada or the UK as a full-time student in the field of biology, chemistry, physics, earth sciences or engineering (or a closely related field), 2. You are an undergraduate who will have completed at least two years of a degree program by the time of the placement 3. Proof that you will be registered as an undergraduate at your university/college for the academic year 2013/2014 (i.e., that you will still have undergraduate status upon your return to your home university). 4. VISA Requirements: British Citizens: no visa required U.S. and Canadian citizens: no visa required if you stay 90 days or less Others: please check with the German Federal Foreign (see Application Process: The full application process is described here: Visit RISE website Register at the website and log-in Search the database for topics that suit your interests Prepare and submit your application online Required documents for your application: 1. Your completed application form 2. Full curriculum vitae/résumé 3. Up-to-date official university/college transcript(s) 4. As a supplement to this transcript, a list of courses you will have completed by the time the internship begins. 5. A letter of reference from a senior academic in your field of study from the university you are enrolled at 6. A cover letter in which you state your motivation to apply for this particular project (projects) 7. A certificate of enrollment (after having been accepted for a scholarship, you have to submit this document as hard copy as well) Application deadline is January 31, 2013

7 Why TU Berlin? Berlin: Capital of Germany, culturally vibrant city Politics, arts, history, entertainment, nightlife, it s all here TU Berlin is one of the largest science Universities in Germany, home of excellent researches, 12 Nobel Prize winners Wernher von Braun, designed the rocket that brought mankind to the moon Dennis Gábor, Nobel Prize for invention of holography Ernst Ruska, Nobel Prize for the first electron microscope Konrad Zuse, built the first completely programmable floating point computer