Description of Work. Project acronym: BigFoot Project full title: Big Data Analytics of Digital Footprints

Transcription

1 Description of Work Project acronym: Project full title: Big Data Analytics of Digital Footprints Project Budget: 3, Euro Work programme topics addressed: Objective ICT : Cloud Computing, Internet of Services and Advanced Software Engineering. Name of the coordinating person: Pietro Michiardi Fax: List of Participants Role Number Name Short Name Country Date Enter Date Exit CO 1 EURECOM EUR FR 1 36 CR 2 SYMANTEC SYM IR 1 36 CR 3 Technische Universität TUB DE 1 36 Berlin CR 4 Ecole Polytechnique EPFL CH 1 36 Federale de Lausanne CR 5 GridPocket GRIDP FR 1 36 Role: CO=Coordinator; CR=Contractor. SEVENTH FRAMEWORK PROGRAMME THEME FP7-ICT Cloud Computing, Internet of Services and Advanced Software Engineering

2 Contents 1 Concept and objectives, progress beyond state-of-the-art, S/T methodology and work plan Concept and objectives Context Motivations Objectives: The Approach Expected results Indicators and success criteria Relevance to the topics addressed in the call Progress beyond the state-of-the-art Application layer Parallel data processing Interactive query engines Distributed data stores Virtualization layer Relevant EU-funded projects Baseline S/T methodology and associated work plan Introduction Methodology Workplan Structure and Breakdown Overall System Description Usage Scenarios Risk and mitigation plans Work packages list Deliverables list List of milestones Implementation Consortium as a whole Impact Expected impacts listed in the work programme Strategic impact Description of Work page 2 of 62

3 3.1.2 impacts listed in the work programme Scientific impact Social and economic impact The European dimension of Plan for the use and dissemination of foreground Dissemination and communication strategy Exploitation strategies Standardization activities Description of Work page 3 of 62

4 1 Concept and objectives, progress beyond state-of-the-art, S/T methodology and work plan 1.1 Concept and objectives The aim of is to design, implement and evaluate a scalable system for processing and interacting with large volumes of data. The software stack allows automatic and self-tuned deployment of data storage and parallel processing services for private cloud deployments which go beyond best-effort services currently available in the state-of-theart. The project addresses performance bottlenecks of current solutions and takes a crosslayer approach to system optimization, which is evaluated with a thorough experimental methodology using realistic workloads and datasets. The ultimate goal of the project is to contribute the software stack to the open-source community Context The amount of data in our world has been exploding. E-commerce, Internet security and financial applications, billing and customer services to name a few examples will continue to fuel exponential growth of large pools of data that can be captured, communicated, aggregated, stored, and analyzed. As companies and organizations go about their business and interact with individuals, they are generating a tremendous amount of digital footprints, i.e., raw, unstructured data for example log files that are created as a by-product of other activities. As discussed in the report in [56] there are many broadly applicable ways to leverage data and create value across sectors of the global economy: Making data access and interaction simple; Collect and process digital footprints to measure and understand the root causes of product performance and bring it to higher levels; Leverage large amounts of data to create highly specific user segmentations and to tailor products and services precisely to meet users needs; Produce sophisticated analytics to improve decision making with automated algorithms; Use data analysis to create new products and services, enhance existing ones, and invent entirely new business models. In summary, use of data is a key basis of competition and growth: companies failing to develop their analysis capabilities will fail to understand and leverage the big picture hidden in the data, and hence fall behind. Nowadays, the ability to store, aggregate, and combine large volumes of data and then use the results to perform deep analysis has become ever more accessible as trends such as Moore s Law in computing, its equivalent in digital storage, and cloud computing continue to lower costs and other technology barriers. However, the means to extract insights from data require remarkable Description of Work page 4 of 62

5 improvements as software and systems to apply increasingly sophisticated mining techniques are still in their infancy. Large-data problems require a distinct approach that sometimes runs counter to traditional models of computing. In, we depart from high-performance computing applications and we go beyond traditional techniques developed in the database community. In the sequel of this document we present a comprehensive system to elaborate and interact with large amounts of data, that can be deployed on top of private, virtualized clusters of commodity hardware. Before delving into the technical objectives we address in, in the next section we motivate our approach by focusing on few use-cases and by clearly indicating the deficiencies of current approaches Motivations We now illustrate the challenge organizations are faced to when dealing with their own digital footprints. Although the following examples apply to a wide range of use-cases, we shall focus on the context and requirements of two companies that are part of the consortium. GridPocket: provides energy-related value-added services solutions. The goal of this organization is to use and process consumption data generated by millions of customers to help electric, gaz, and water utilities to reduce their CO 2 emissions. Data analysis tasks apply, for example, to the following cases: i) Consumer billing: a monthly scan of all customer data to produce consumption reports; ii) Consumer Dashboard Web applications: analysis of the whole consumption data to create personalized customer reports; iii) Consumer segmentation: execution of sophisticated algorithms to classify consumers based on their consumption patterns and produce, for example, personalized contract offers; iv) Provisioning applications: analysis of geographical consumption patterns and design of predictive algorithms to help operators in provisioning of their electric network. Symantec: Symantec is one of the world industry leaders in security software, focused on helping customers protect their infrastructures, their information, and their businesses. Through its Global Intelligence Network, Symantec has established some of the most comprehensive sources of Internet threat data in the world, with 240,000 sensors monitoring network attack activity in more than 200 countries through a combination of security products and managed What is a private cloud? Private cloud deployments [44] resemble those that are public: one or more datacenters clusters of physical machines, interconnected via a high speed local area network host virtual instances of servers which can be customized from scratch, or that host services and applications that are exposed to end-users via simple and standard interfaces, e. g., HTTP. Such installations are private in that services and applications are not accessible from the outside world: they are confined to clients that interact with them from within the security perimeter of a company or organization. In addition, data stored and manipulated in a private cloud never leave the company s datacenter, hence the application of policies and rules to protect data from unauthorized access can be enforced with legacy approaches to security and access control. Description of Work page 5 of 62

6 services 1. However, security analysts are challenged in their daily job of analyzing global Internet threats because of the sheer volumes of data Symantec collects around the globe [31]. In the cyber security domain, this is sometimes referred to as attack attribution and situational understanding, which are considered today as critical aspects to effectively deal with Internet attacks [43, 58, 70]. Attribution in cyberspace involves different methods and techniques, which, when combined appropriately, can help to explain an attack phenomenon by (i) indicating the underlying root cause, and (ii) by showing the modus operandi of attackers. The goal is to help analysts answer some important questions regarding the organization of cyber criminal activities by taking advantage of effective tools able to generate security intelligence about known or unknown threats. What are the common requirements, with respect to analytics tasks, that characterize the two scenarios described above? Clearly, there are two main ways of interacting with data: i) batch processing of large amounts of data (for analysis, mining, classification, etc), and ii) selective queries on subsets of data that can be issued through manual inspection or by Web applications that read information related to a single user or a subset thereof. Hence, the need for a unified system capable of supporting both kinds of interactions with data. The requirements exemplified above can be somehow met with existing technology. What are the typical solutions that are available today and which novelty brings to the current state-of-the-art? First, we review the three most prominent approaches to large-data analytics: i) buy database management systems and appliances from big vendors, ii) use public cloud services and iii) use open-source projects/products. Data Analytics Appliances: examples of products that attack the Big Data market include EMC/GreenPlum, Splunk, Oracle Big Data Appliance and many more. Addressing large-data analytics problems with such an approach has the advantage of using a product that bundles together hardware and software, which comes with production-level support. However, these products are closed-source and little is known on their effective performance, and how they behave when compared to alternative solutions. Furthermore, the costs associated with a production-level deployment is exorbitant, with licensing fees proportional to the amount of data to be processed. It should also be noted that such systems are difficult (at best) to deploy and tune [62]. Last but not least, this approach suffers from the data lock-in problem: once data is loaded and analytic jobs are written for a specific platform, it is hard to move to another product. Public-cloud Analytic Services: a prominent example of this kind of approach is that of Amazon Elastic MapReduce 2. The idea is, for organizations such as the ones discussed above, to ship their data to a public cloud storage service, prepare analytic jobs in an offline manner (e. g., a MapReduce program, but also SQL-like queries), submit the data processing code to a web application and select the amount (and quality) of resources that will be dedicated to computation. Although this seems an appealing approach, it is not exempt from significant drawbacks. First, today s public cloud products offer only a best-effort service: in practice there is no performance guarantee, and current servicelevel-agreements account also for some down-time periods in which the service may be un-available. Furthermore, current EU directives are strict in what concerns privacy Description of Work page 6 of 62

7 issues 3 4, which drastically limit the applicability of public-cloud services. Finally, it is often unacceptable for industries to put sensitive data sets on public cloud infrastructure, for obvious confidentiality reasons (e.g., for a security software industry, uploading in a public cloud malware and attack data targeting their customers is not an option). Open-source projects: as a prominent example we consider Apache Hadoop we review other projects in Section 1.2 which consists in an ecosystem of sub-projects each dealing with a particular technology (e.g., a parallel processing framework similar to Google s MapReduce [28], a distributed data store similar to Google s BigTable [17], etc...). Additionally, several commercial products, based on Hadoop, have seen the light in the last two years: IBM BigInsights, MapR, Cloudera Hadoop Distribution. The main focus of Hadoop, is to reach a production-level quality: a lot of effort has been dedicated to address bugs, to improve interoperability among components, to improve custom deployments on dedicated cluster and, recently, to eliminate single-point-of-failures in the original architecture. In synthesis, open-source projects lack a structured effort toward system optimization and fail to cover the multiple layers involved in typical deployments. In summary, although the current state-of-the-art offers a rich set of approaches to tackle largescale data processing problems, we identify the following key points that are currently not addressed by existing technologies: Data interaction is hard. Current approaches lack an integrated interface to inspect and query (processed) data. Moreover, little work has been done in the literature to optimize the efficiency (and not only the performance) of interactive queries that operate on batch processed data. Parallel algorithm design is hard. While the design of parallel algorithms is already a difficult topic per se, current systems make the implementation of even simple jobs a tedious and exhausting experience. As such, parallel programs tend to have limited usability and have a short life-time, i. e., code re-use is limited. Lack of optimizations. Current systems entrust users with the task of optimizing their queries and algorithms. Moreover, data-flow and storage mechanisms are data-processing oblivious, which leave room for several optimizations that have not been addressed by current solutions. Deployment tools are poor. Management tools are still in their infancy and target solely bare metal clusters. In addition, the effects of virtualization has been largely overlooked in the literature. Illustrative example of a deployment. We conclude this Section with an illustrative example to highlight the benefits brought by the project. For this example, we assume is used by a company say Symantec willing to explore the secrets hiding in the vast amount of data they collect Description of Work page 7 of 62

8 We assume that Symantec already is in possession of a private cloud deployment, that is the set of machines on which will execute in addition to existing, collocated services is physically deployed in their premises. What are the steps a Symantec user 5 is required to follow? exposes storage, data processing and querying components as a Platform-as-a-Service: in practice this translates into the following steps: Using a standard interface (shell-based or web-based), the user specifies the location of the data she will operate on; the system injects such data in the relevant storage layer (distributed file system or data store); Using a standard interface (shell-based or web-based), the user specifies the data processing tasks that are required to run on her data. Data processing can be delay tolerant that is batch oriented analysis or latency sensitive that is interactive queries and the system can be instructed to direct such tasks to the corresponding engine. The system automatically deploys the necessary machinery (in terms of virtual machines) to execute data analysis tasks, and perform the necessary optimizations (data and virtual machine migration, data flow enhancements, work sharing optimizations) to obtain an aggregate result (that we also label metadata) Using a standard interface (shell-based or web-based) the user can further inspect aggregate statistics to extract useful information or to refine the data processing tasks. As the example above illustrates, offers a unified setting to store, process and interact with data, which is exposed to the user with simple and standard interfaces to a cloud service. All the complications related to deployment, tuning and optimizations are handled transparently by the system. In addition to the above scenario, it should be noted that offers tap-in points for experienced users that are willing to sacrifice the simplicity of this approach for a more controlled usage of, for example, the parallel processing layer. This provides the additional freedom for the user to decide how to interact and use Objectives: The Approach The key challenge of is to conduct cutting edge research on several issues related to Big Data Analytics applications and services, producing relevant output for the research community and with the potential of being immediately relevant and available to real-world, industrial problems. has a number of important scientific and industrial objectives. These include fundamental (scientific and research oriented) and experimental elements, completed by contribution to the open-source community. Fundamental Objectives. The fundamental objectives of are threefold and each is described in detail in the following sub-paragraphs. The technical work packages (WP) described in the remainder of this document, including WP 2,3,4 and 5 contribute to the research-oriented work carried out in. Description of Work page 8 of 62

9 Fundamental objective 1 Given the potential offered by, it is important to clearly define use cases, scenarios and detail the workloads that derive from real-world applications. Moreover, it is important to make an effort to generalize workloads to encompass other applications that share similar traits to those addressed in. The work for this first objective is executed in WP2, and it is scheduled to take place in the first phase of the project. Fundamental objective 2 In, the focus is on system optimization, starting from the top layer down to lower layers of the system stack. Each component of the architecture requires special care in achieving efficiency, scalability and reliability goals. Furthermore, little is known on the combination of optimization techniques at different layers of the stack, that we label a cross-layer approach. As such, part of this fundamental objective is the design, implementation and validation of: 1. a novel component aiming at transforming a high-level, declarative query language into parallel programs; This work corresponds to Task in work package 3; 2. optimizations to the inner data-flow of the parallel processing framework adopted in ; This work is carried out in Task in work package 3; 3. a service-oriented query engine for improving data interaction, and its integration with distributed datastores; This work is done in Task in work package 3; 4. novel data partitioning and placement mechanisms that aim at optimizing the storage layer. This work is carried out in work package 4. Fundamental objective 3 The system stack is designed to work in a virtualized cluster, consisting of a pool of virtual machines and virtual networks. Our ultimate goal is to go beyond besteffort services and offer performance guarantees. An underlying objective is to adopt a cross-layer approach to optimization. This fundamental objective is achieved by the work in WP5, which covers several aspects of infrastructure virtualization and algorithms to support the specification of requirements and constraints dictated by components. Experimental Objectives. strives for cutting-edge experimental work, that is driven by the applications envisioned in the project. To this end, each partner will work on experiments using several platforms deployed at their respective premises. Each experimental platform will evaluate selected components, during the early stages of the development process, and the whole platform once it is available. Experimental objectives are grouped by partner type, academic first, then industrial. Experimental objective 1 EUR, TUB, and EPFL will work towards the design and deployment of experimental test beds to validate and analyze the performance of each component, individually that is before a consolidated implementation is available, hence as part of the work in WP3, 4 and 5 and as a whole, when the software stack is fully available. As part of the performance evaluation and benchmarking tasks of WP2, academic partners Description of Work page 9 of 62

10 will use their platforms and perform fine-grained measurements on all system layers, including individual computing node resources, the network equipment and topology, and the stack. In the process, the partners will use auxiliary tools for generating synthetic data and workloads during the early stages of the project. Tests on the consolidated platform, in addition, will use real-world data (or anonymized versions thereof) and workloads. Experimental objective 2 As industrial partners, SYM and GRIDP, will work towards the design, implementation and evaluation of algorithms for scalable clustering, data mining and multi-criteria analysis of very large data sets by leveraging the underlying layers of the system. As such, they will deploy local test beds to perform an experimental validation and performance assessment of the software stack, which is part of Task 2.4 of WP2. Such experiments will follow the development progress of, that is they can take place even if a consolidated version of the system is not fully available: each individual component strives at optimizing the system performance, and can be tested in isolation. The final system will be deployed by industrial partners which, together with academic ones, will analyze in detail the effects of a cross-layer approach to optimization. As a complement, industrial partners will also consider alternative commercial approaches, so as to establish a comprehensive comparative analysis of the performance and efficiency of as compared to the state-of-the-art. Open-source Objectives. goes beyond the fundamental elements outlined above and contribute to the open source community with the outcomes of the project. takes two distinct directions toward this goal: contributions to existing open-source projects, such as Apache Hadoop, when these exists and are widely adopted by the industry; establishment of new open-source projects for components that cannot be integrated to an existing open-source project. We remark here that contributing to existing open-source projects should be regarded as a feature rather than a limitation of : several other projects (that we overview in section 1.2) related to failed largely due to the fact that they had little or no impact. adopts the Apache Software Foundation (ASF) v2.0 license, which is compatible with the exploitation plans, especially those of the two industrial partners of the consortium. As discussed above, strives to contribute to existing open source projects that emerged in the past few years as de-facto standards for Big Data applications: in particular, by focusing on the Hadoop project, embraces (and is compelled to) the ASF v2.0 license. For other open-source initiatives, the software license will be discussed on a case-by-case analysis, with a preference for the ASF v2.0 license. As part of the open-source objectives, will create a number of Git repositories on a dedicated server for all software deliverables. As such, the consortium will push frequent updates to the released software, including early releases, bug fixes, and a consolidation effort to bring software quality as close as possible to a production environment. Description of Work page 10 of 62

11 The following Table summarizes the objectives and provide an indication on a measurement metric to asses the progress and accomplishments achieved in the project. A detailed description of such metrics and related indicators is available in Sec Objective Category Fundamental Experimental Open-source Measurement metric The achievement of fundamental objectives is measured by the impact of articles published in prestigious international journals and conferences in the research community. Furthermore, achievements can be measured by the interaction with related research projects and by the establishment of training activities, including Doctoral Schools and workshops. These objectives can be measured by asserting the deployment of experimental platforms by project partners and by their relevance in carrying out the validation and performance evaluation tasks defined in WP2. Additionally, the adoption of (or a subset of its components) for the Symantec WINE platform is another measurable metric to judge the achievements of. Automatically generated activity reports and graphs (number of contributors, feature requests, bug fixes, frequency of contributions, etc...) are an integral part of the public repositories of Big- Foot and constitute a natural measurement metric to assess the project progress. In addition, the creation of new JIRA tickets and the eventual adoption by the open-source community of contrib modules output from is another metric to monitor the progress and accomplishments of the project. Note that JIRA tickets are uniquely numbered and associated to the identity who created them: as such, it is possible to correlate such tickets with activities Expected results We now summarize what are the expected results of the project. Once the products delivered by the project are clearly defined, we will move on to describe who are the potential users of the stack. The results we expect from falls into two main categories, namely research and software, that we describe below. Research results. blends theory and system research in the domain of large scale data management and distributed systems. Given the composition of the consortium, that includes three academic partners, the main expected results of materialize in research articles submitted to prestigious venues, including international conferences and journals, as described in details in Sec. 3. The main domains in which we expect to have an impact include (but are not limited to): scalable algorithm design, scheduling protocols, work-sharing optimization, automatic physical design and dynamic data layout, system reliability and embedding algorithms. In addition, the research activity underlying involves a massive implementation effort toward the design and deployment of benchmarking tools and workload suites, cross-layer and work-sharing optimization Description of Work page 11 of 62

12 techniques and a thorough experimental approach to evaluate system performance and scalability. Software results. In addition to research activities, aims at having an impact in the industry, which is ensured by the two industrial members of the consortium and by the presence of an industrial advisory board that will rely on to understand and define strategic decisions during the project execution. will contribute to existing open-source project, including Apache Hadoop and OpenStack, and establish new projects following a two-pronged approach, which is described in detail in Sec Essentially, prompt and incremental commits to experimental repositories represent the first phase of the open-source strategy. Subsequently, selected components will be pushed to existing open-source communities for review, contributions and final acceptance. We now describe the potential users of the software stack: Engineers, data scientists and product managers: the first category of users are employees of small and large companies that wish to exploit the potential of their data. Such users should not bother with the tedious exercise of trying to dimension, deploy and tune a complex distributed system to store and process data. Instead, they will use as a Platform-as-a-Service solution, requiring minimal manual intervention, thus allowing this user category to focus on their task: extract useful information from data. For such users, represent a drastic improvement with respect to the state-of-theart 6. Developers: this category encompasses the open-source community that will contribute to and its components. It should be noted that many large corporations (including Facebook, Twitter, IBM, Cloudera, and many more) currently embrace the open-source model of contributing to successful projects, and this is especially true for Hadoop. As such, we believe that offers fundamental challenges to these users and, at the same time, will benefit greatly from their contributions. Researchers: the research activity carried out in will produce a set of fundamental tools to understand production workloads based on real-world data and to assess the performance and scalability underlying the systems to store and analyze such data. Essentially, the experimental approach of will enable this set of users to carry out their own research and eventually reproduce and compare their work to Indicators and success criteria The success of can be measured based on the following indicators and success criteria, which are related to its objectives: Fundamental Objectives Peer reviewed articles in the highest quality international and national scientific journals/conferences. We expect not less than one 7 article per year per partner. 6 The only competitor for is Amazon Elastic MapReduce, which is a best-effort service. 7 A large fraction of the work in blends theoretic and system aspects of a variety of topics system virtualization, parallel processing engines, distributed data stores and query engines, high-level languages and scalable algorithms that may require an important implementation effort. As such, it is advisable to target a realistic publication rate, which incorporates the engineering effort behind a scientific article. Description of Work page 12 of 62