Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses on optimization and planning problems that arise in the context of big data analytics and data science. I generally apply a broad portfolio of techniques in order to tackle those problems, ranging from approximation algorithms over massive parallelization to quantum computing. The latter research branch is based on a grant giving me access to a D-Wave adiabatic quantum annealer. Beyond optimization, I am interested in text mining and machine learning approaches that allow us to extract novel insights from large data sets. I have recently completed a research project in that space in collaboration with several researchers from Google Mountain View. The primary goal of my research is to make big data analytics more efficient and to save users from daunting configuration and tuning tasks. 1 Dissertation Work My dissertation focuses on generalizations of the classical query optimization problem that address the specific context of big data analytics. Those generalizations lead to excessively hard optimization problems. When I started working in this area, existing approaches required hours to solve a single problem instance. In my dissertation, I have developed various methods that make those problems tractable. In order to achieve that, I also experimented with several novel techniques, namely massive parallelization and quantum computing, that had never been used before for query optimization or related problems. The goal of query optimization is to map a declarative query (describing the data to generate) into an optimal query plan (describing how to generate the data). Many recently released systems for big data analytics (e.g., Hive, Google s BigQuery, Facebook s Presto, or Pivotal s Greenplum) allow users to analyze data via declarative query languages. All those systems must solve the query optimization problem, and my dissertation work is therefore relevant to all of them. The query optimization problem has been intensively studied in my community. Nearly all work on query optimization assumes however that alternative query plans are compared according to a single cost metric. This model is appropriate for desktop analytics but not for big data analytics. Large data sets are often analyzed in the Cloud, using approximate processing techniques to reduce the computational burden. In this context, execution fees and result precision or recall become important cost metrics in addition to run time. Cloud providers need for instance to consider tradeoffs between the amount of system resources dedicated to a single user and the performance perceived by that user. Query optimization is therefore a multi-objective optimization problem in the context of big data analytics. Current query optimizers do not consider multiple cost metrics in a principled fashion. This forces users to explore many related options in a trial and error fashion. From my own experiences with big data analytics, for instance during a recent internship at Google Mountain View, I can say that this is a painful process. During that internship, I had to tune the execution of a workflow processing large amounts of data that consisted mainly of binary join operations. I had to choose the join order, the amount of system resources dedicated to execution, and sample sizes for several input data sets. All those decisions were dependent upon each other and led to different tradeoffs between execution time, resource consumption, and output quality. Having a query optimizer that is able to automatically find optimal tradeoffs between those cost metrics would have saved me a lot of time. It would also have led to better decisions as I was certainly never able to find optimal cost tradeoffs. Building such an optimizer requires however efficient algorithms for multi-objective query optimization. Such algorithms did not exist prior to my dissertation. At the start of my dissertation, there was only one query optimization algorithm that would have been generic enough to consider all relevant cost metrics in big data analytics. This algorithm had not been experimentally evaluated prior to my PhD. I integrated that algorithm into the Postgres database system and tested it on standard queries. Even the optimization of relatively simple queries took several hours. This algorithm was therefore not suitable to be used in a real system. During my dissertation, I have explored various approaches to make query optimization with multiple cost metrics practical. I have also explored novel approaches (e.g., massive parallelization and quantum computing) to make classical query optimization and related optimization problems more efficient. Figure 1 shows an overview of my work in that space. I discuss other research on large-scale text mining and machine learning at the end of this section. I developed approximation schemes for multi-objective query optimization that allow users to gradually relax optimality guarantees to decrease optimization time [2]. They represent a sweet spot between the aforementioned algorithm (which guarantees to find an optimal plan but has prohibitive optimization time) and pure heuristics (which can produce arbitrarily bad query plans in the worst case). It turns out that we can find guaranteed near-optimal plans in seconds where finding guaranteed optimal plans takes hours. Later, I have proposed an incremental algorithm that is based on those approximation schemes [3]. The latter algorithm divides approximate optimization into many small incremental computation steps. Users can integrate their feedback after each step to guide the optimizer quickly towards interesting 1

User Pre-Computed Solutions [4, 8, 9] Interactive Optimization [3] Optimizer Multi-Objective Approximation Schemes [2] Multi-Objective Randomized Algorithms [7] Optimization Platform Probing on Data Samples [5]... Massively Parallel Optimization [11] Quantum Computing [10] Linear Programming [6] Figure 1: Overview of my dissertation work with corresponding publication references. parts of the search space. The main goal of that algorithm is to enable interfaces that let users find their preferred execution cost tradeoff in an interactive process. Both of the aforementioned approaches make multi-objective optimization practical by reducing optimization time. Another possibility is to move optimization before run time. Optimization time may be high, but the time constraints for optimization are relaxed. Moving optimization before run time is possible if queries correspond to a query template that is known before run time. I have presented an algorithm that calculates all plans realizing optimal execution cost tradeoffs for each possible instantiation of a given query template [4, 8, 9]. This requires us to consider multiple parameters and multiple cost metrics during optimization. This problem model unifies and generalizes most previously proposed problem variants in query optimization. The publication in which I introduce the problem and propose a corresponding algorithm was selected as ACM SIGMOD Research Highlight 2015 and invited into the Best of VLDB 2015 special issue of the VLDB Journal. A video recording of a talk in which I describe the three aforementioned approaches in greater detail can be found online at https://www.youtube.com/watch?v=ez9fhvoj0ws. All approaches proposed so far work well for medium-sized queries. They are not sufficiently efficient to handle large queries with huge search spaces. To deal with such queries, I have introduced a randomized algorithm for multi-objective query optimization [7]. This algorithm exploits several specific properties of the query optimization problem and outperforms general-purpose algorithms for multi-objective optimization significantly. The drawback of this algorithm is the lack of worst-case guarantees on the quality of the generated plans. In order to treat large queries while maintaining formal quality guarantees, we can exploit a particularity of big data analytics platforms: they have to be massively parallel. If we use that parallelism for query execution, why shouldn t we use it for optimization as well? Parallel algorithms for query optimization have been proposed prior to my dissertation. Those prior algorithms are however only able to exploit very moderate degrees of parallelism. They employ a finegrained problem decomposition method that requires parallel optimizer threads to share intermediate results. This leads to huge communication overhead when used in the shared-nothing architectures that are typical for large-scale analytics platforms. I proposed a radically different parallelization method that decomposes the search space in the coarsest possible way [11]. The search space is divided into a number of equal-sized partitions that corresponds to the number of optimizer threads. Those partitions can be searched independently without communication between different threads. I was able to show that this approach can parallelize query optimization over large clusters with hundreds of nodes. The decomposition method is not limited to multi-objective query optimization but can be used for classical query optimization and many other variants as well. Quantum computing can be seen as a different form of parallelization (this intuition is of course highly simplifying). Quantum computers leverage quantum physics to speed up computation. They operate on qubits instead of bits. Qubits can be in a superposition of states (1 and 0) that would be considered mutually exclusive according to the laws of classical physics. Operating on qubits allows quantum computers to explore multiple computational paths in parallel. I obtained a research grant giving me access to the first commercially available machine that is claimed to 2

exploit quantum physics to solve hard optimization problems: the D-Wave adiabatic quantum annealer. So far I have used that machine to solve the multiple query optimization problem [10]. This is a query optimization variant whose goal it is to merge query plans of different tenants into a globally optimal plan, taking into account possibilities to share computation between different tenants. This optimization problem is highly relevant to big data analytics where multiple users often analyze the same centrally stored data set. My research led to the first experimental paper on quantum computing in the database community. The main contribution in that paper is a mapping algorithm that translates multiple query optimization instances into strength values for magnetic fields on and between the qubits of the quantum annealer. Using the quantum annealer is indeed challenging, and many research problems need to be solved before quantum computing becomes useful for analytics optimization. I discuss some of them in the next section. In my work on quantum computing, I developed a problem transformation that allows us to leverage a sophisticated hardware solver. In another work stream, I am currently developing approaches that enable us to leverage software solvers for query optimization. More precisely, I transform query optimization instances into mixed integer linear programming (MILP) problems. My first results show that such solvers can treat significantly larger search spaces than classical query optimization algorithms [6]. This is not surprising since MILP solvers have very steadily improved their performance (hardware independently) over the past twenty years. By linking query optimization to MILP, we will benefit from all future advances in this highly fruitful research domain. Most of my work in query optimization addresses the challenge of having large search spaces and many cost metrics. My most recent (ongoing) work in this area addresses a complementary problem: the problem of missing information. Analytics queries nowadays often contain user-defined predicates that have to be treated as black boxes by the optimizer. Their properties can be estimated by evaluating them on data samples, but we need a formal framework to decide how much to sample and how to prioritize sampling. I recently introduced the probably approximately optimal query optimization framework [5] to guide such decisions. Evaluating predicates on data samples yields only confidence bounds on their properties. Confidence bounds on predicates translate into confidence bounds on the cost of alternative query plans. For that reason, I have argued that the goal of query optimization should be to find a plan whose cost is near-optimal with high probability. In ongoing work, I am exploring optimization approaches for that problem model. Beyond my work on analytics optimization, I have also developed new analytics applications. Together with researchers from Google Mountain View, I have developed a system that mines Web text to determine for knowledge base entities the subjective properties that the average user associates with them [1]. Knowing such associations enables us to answer Google queries containing subjective predicates (e.g., the query big cities in California ) from structured data. Mining subjective properties is challenging since we find many conflicting statements about the same entity. I developed an approach for learning user behavior models for specific entity types and properties in an unsupervised fashion. Those models are used to reliably infer the majority opinion from conflicting statements. I used that system to mine billions of subjective associations from terabytes of Web text. The mined associations match the opinions of a test user group in the majority of cases. This is not the case when using other state-of-the-art systems for mining objective properties. This shows that subjective property mining requires specialized systems. 2 Future Work One of my future research goals is to make data analysis more efficient. I intend to pursue two research directions that contribute towards that goal in different ways: First, I will study optimization approaches that allow us to make better use of current computer technology. Second, I plan to study the potential of an entirely new technology, namely quantum computing, for complex data analytics. Note that I have already taken the first steps in this direction, leading to the first experimental paper on quantum computing in the database community [10]. The first of those two research branches is targeted at immediate impact while the second one lays the foundations for adopting a novel technology for certain analysis tasks in the long term. Beyond optimization, I plan to build novel systems for knowledge extraction. In particular, I plan to start new projects based on my prior research on information extraction at Google Mountain View. The amount of data keeps growing while the evolution of classical computers is slowing down. This makes it interesting to explore new computational paradigms. Quantum computing seems currently like a promising technology to complement conventional computers for certain tasks in the future. Major IT companies such as Google and IBM have heavily invested in this technology. With the D-Wave adiabatic quantum annealer, the first device that exploits quantum mechanics to solve optimization problems of non-trivial size has recently appeared. Quantum annealing is a fast-evolving technology (e.g., the number of qubits has so far steadily doubled from one annealer model to the next) that suffers however from various restrictions. Those restrictions make it challenging to use quantum annealing in practice. It is unclear, if, how, for which problems, and within which time frame data analysis can benefit from quantum annealing and related technologies. I plan to answer those questions in my future research. I generally see two ways in which data analysis can benefit from quantum annealing. Either directly, by solving 3

certain analysis tasks that can be expressed as optimization problems on the quantum annealer (e.g., problems related to the training of binary classifiers, a key step in many data science applications, have already been solved via quantum annealing), or indirectly, by solving optimization problems that optimize the use of conventional computers for data analysis. The work that I have done so far in this area (solving multiple query optimization on the quantum annealer) falls into the second category. In my future work, I will consider the first possibility as well. Note that my research goal is not to improve quantum computing hardware. My perspective on quantum computing is that of a user. My goal is to find out how to exploit quantum computing technology that is available today (or will be available in the foreseeable future) for problems that are relevant to data analysis and to the database community. There is a large body of work in the database community describing problem transformations by which existing software solvers can be exploited for database-related optimization problems. My focus is similar in that I plan to develop problem transformations that allow using a very specific hardware solver (the adiabatic quantum annealer). A first research goal would be to identify and to characterize analytics-related optimization problems that can benefit from quantum annealing in the long term. This is certainly not the case for all problems: for many optimization problems, the transformation into the restrictive input format supported by current quantum annealers causes excessive blowups in the problem representation size. Second, I plan to investigate decomposition methods that divide optimization problems into smaller sub-problems that can be represented with the available number of qubits. I believe that such methods are required since the number of qubits is still very limited on current machines (the annealer I was experimenting with had around 1000 qubits). Furthermore, I plan to develop problem-specific mapping methods that efficiently map problem instances to the restrictive input format supported by the quantum annealer. Finding a problem mapping that consumes the minimal number of qubits is in the general case an NP-hard optimization problem. As I have shown in the case of multiple query optimization, it is however possible to find efficient, problem-specific transformation algorithms that achieve asymptotically optimal qubit counts. Finally, my experimental results for the multiple query optimization problem show that the relative performance of the quantum annealer compared to conventional computers varies significantly even between different instances of the same optimization problem. I therefore believe that the first systems to exploit quantum computing in the future will be hybrids that exploit a combination of conventional and quantum computing. Finding out how to combine those two computational paradigms in the best way (e.g., finding out how to decide for specific problem instances which one of them to use) is therefore another important research challenge in this domain. I plan to integrate some of my research in this space into the first prototypical implementation of a hybrid classicalquantum analytics engine that uses a combination of conventional computing and quantum computing to speed up data analysis. I believe that building such a system is required in order to answer many questions relating to the practical long-term potential of quantum annealing and related technologies for large-scale data analysis. Note finally that I currently have a grant giving me access to a quantum annealer. I might apply for additional grants in the future. In each case, I do not expect my future university to provide me with access to such a machine. In principle, it would even be possible to gain insights using a simulated annealer instead of a real machine within the system I described before. Quantum computing has the potential to make complex data analysis more efficient in the long term. I believe that advanced optimization methods can help us to better exploit our current technology in the short and medium term. In my dissertation, I have experimented with several novel techniques to solve classical analytics-related optimization problems. In particular, I have shown that massive degrees of parallelism (which are typical for big data analytics platforms) can be exploited to make query optimization more efficient. In the future, I plan to use those techniques to address novel optimization problems that we have not even dared to address so far. In particular, I plan to treat classical analytics-related optimization problems in more fine-grained search spaces than before. In query optimization, we would for instance typically generate query plans that consist of standard operators. If we increase the resolution and consider the sub-functions or even the instructions that implement those operators then we discover optimization potential that is not visible in the more coarse-grained perspective. This idea connects to approaches for synthesizing low-level code for single operators via cost-based optimization that have recently appeared in the database community. The optimization methods that are currently used require several minutes to synthesize a single operator on a standard computer. Generating entire query plans at run time will require fundamentally different optimization methods. I believe that some of the techniques that I used for classical query optimization could be helpful in that context. Instead of increasing resolution, we might also zoom out and treat analytics-related optimization problems in a more holistic fashion than nowadays. There are many optimization problems that we currently break into sub-problems to reduce optimization overhead (e.g., breaking query optimization into planning and scheduling decisions, considering different tenants independently in a multi-tenant system). Breaking up optimization problems often means that we lose formal guarantees to find optimal solutions to the original problem. Having more powerful optimization methods, we can afford not to make some of those compromises anymore. Independently of the search space, the optimizer needs to estimate the cost of alternative processing plans to compare alternative solutions. This is increasingly difficult. Queries include nowadays user-defined functions and predicates that 4

can be implemented by complex code, by calls to external services, or even by calls to human crowd workers. In many of those cases, it is impossible for the optimizer to estimate the properties of such operators via static analysis. We need to think of novel ways in which the optimizer can obtain information about them. Evaluating such operators on data samples is one possibility. Another possibility is to make optimization an interactive process in which the optimizer asks well-targeted questions to the user about specific operators. In each case, we need formal frameworks that prioritize information collection and weigh the cost of collecting additional information against the risk of choosing sub-optimal plans due to missing information. I have recently started to develop formal frameworks that enable us to model such scenarios. In the future, I plan to explore novel interaction models between user and optimizer and novel optimization methods that can deal with high degrees of uncertainty. In particular, I plan to study methods from the area of reinforcement learning. This area offers a rich set of approaches for prioritizing information collection that could be helpful for query optimization and related problems. Query optimization is just the tip of the iceberg. There are various optimization problems that relate to efficiency in data analytics. In order to name just a few examples, we need to decide where to store data, how to store data, and which auxiliary index structures to create. Similar to query optimization, many of those optimization problems need to be revisited in light of the specific challenges (e.g., having multiple cost metrics) and opportunities (e.g., having massive degrees of parallelism) that characterize the context of big data analytics. I plan to do so in my future research. Beyond my research on optimization, I am interested in developing novel systems for knowledge extraction. For instance, I plan to start a new project based on the research that I did at Google Mountain View. We nowadays have large knowledge bases describing objective properties of various objects. As a result of my research at Google, we also have large knowledge bases containing subjective property associations. Combining both data sets would lead to interesting insights about the semantics of subjective properties. In principle, it is possible to infer the semantics of many subjective properties by correlating information about subjective properties with information about objective properties. For instance, by manually correlating the output of the workflow I developed at Google with objective properties, I was able to infer that cities in California are commonly called big starting from a population size of around 250,000. How to generate such insights automatically and reliably at a very large scale is an interesting research question. We would have to identify subjective properties that relate to objective properties and, for a given subjective property, we would have to identify the most relevant group of objective properties. A threshold is an example of a simple dependency while more complex relationships are possible. By restricting the input corpus to Web content generated in specific regions, we could infer localized semantics. This might lead to interesting insights about cultural differences. Having translated subjective properties into conditions on objective properties, we will also be able to associate subjective properties to entities based on their objective properties alone. I believe that this approach yields subjective associations of a higher quality than when mining them directly from Web text. References [1] I. Trummer, A. Halevy, H. Lee, S. Sarawagi, and R. Gupta. Mining subjective properties on the Web. In SIGMOD, pages 1745 1760, 2015. Talk Recording: https://www.youtube.com/watch?v=a9rybydqrxa [2] I. Trummer and C. Koch. Approximation schemes for many-objective query optimization. In SIGMOD, pages 1299 1310, 2014. [3] I. Trummer and C. Koch. An incremental anytime algorithm for multi-objective query optimization. In SIGMOD, pages 1941 1953, 2015. Talk Recording: https://www.youtube.com/watch?v=j54gvit9uao [4] I. Trummer and C. Koch. Multi-objective parametric query optimization. VLDB, 8(3):221 232, 2015. Talk Recording: https://www.youtube.com/watch?v=ho3iasfftjy [5] I. Trummer and C. Koch. Probably approximately optimal query optimization. 2015. Url: http://arxiv.org/pdf/1511.01782v1.pdf. [6] I. Trummer and C. Koch. Solving the join ordering problem via mixed integer linear programming. 2015. Url: http://arxiv.org/pdf/1511.02071v1.pdf. [7] I. Trummer and C. Koch. A fast randomized algorithm for multi-objective query optimization. In SIGMOD, 2016. [8] I. Trummer and C. Koch. Multi-objective parametric query optimization. ACM SIGMOD Research Highlights, 2016. Currently Invited. [9] I. Trummer and C. Koch. Multi-objective parametric query optimization. VLDB Journal, 2016. Currently Invited. [10] I. Trummer and C. Koch. Multiple query optimization on the D-Wave 2X adiabatic quantum computer. In VLDB, 2016. Conditionally Accepted. Preprint: http://arxiv.org/pdf/1510.06437v1.pdf. [11] I. Trummer and C. Koch. Parallelizing query optimization on shared-nothing architectures. In VLDB, 2016. Conditionally Accepted. Preprint: http://arxiv.org/pdf/1511.01768v1.pdf. 5