BigFoot: Big Data Analytics of Digital Footprints

Transcription

1 BigFoot: Big Data Analytics of Digital Footprints Project name BigFoot Project ID FP7-ICT-ICT Call 8 Project No Working Package Number WP2 Deliverable Number D.2.1 Document title Current practices of Big Data analytics Document version 1.0 Author ALL Date 5-April-2013 Status Public

2 Revision History Date Version Description Author 05/02/ Initial Deliverable Setup Yun Shen Olivier Thonnard 25/03/ Version ready for internal reviews BigFoot Team 03/04/ Final version BigFoot Team 04/04/ Final review BigFoot Team FP7-ICT-ICT Call 8 Project No

3 Executive Summary This document is released after the first six months of the BigFoot project, in the middle of the first phase: design and specification. We present current practices and research work that are related to our work within the BigFoot project, and the use cases that we examined and documented in deliverable D.2.1. Virtualization is not discussed in this document, as the related work in that area is described in deliverable D.5.1. The purpose of this document is to illustrate current practices and related work, both in the areas where we contribute new research, and in the areas that interface with BigFoot (i.e., how to feed data to BigFoot, and how to use the results that BigFoot will provide). The structure of the document follows the flow of information: data collection in Section 2 (how to get data), data management in Section 3 (how to store data), computational frameworks in Section 4 (tools to handle and transform data), and analysis and applications in Section 5 (what to do with data). In Section 6, we conclude with a short review of existing data analytics integrated products. With respect to data collection (Section 2), we use existing technology to input data in our systems: while the cyber security use case adopts standard techniques, the smart grid data collection case has some particularities that are reviewed in this document. Data Management is a first key area where BigFoot will bring its contributions. In Section 3, we review storage strategies for data warehousing focusing on the differences between row- and column- oriented storage strategies and we introduce the adaptive caching strategy used by NoDB for querying row data. In addition, we review existing technology for distributed data storage, for databases in the SQL and NoSQL ( not-only SQL ) fields and for distributed file systems. BigFoot targets two main types of computational frameworks (Section 4): batch processing and interactive queries. With respect to batch processing for big data, Apache Hadoop is quickly becoming a de facto standard, with a large set of tools, libraries, high-level languages, and academic research efforts behind it; another notable project is Spark, which enables in-memory data analysis using the same MapReduce programming model. With respect to interactive query processing the panorama is more fragmented, with Google Dremel being the inspiration of two important open source projects: Apache Drill and Impala. For completeness, we also describe stream processing systems such as S4 and Storm. The computational frameworks described above are designed to perform a variety of tasks over data: in Section 5, we review those that are of most FP7-ICT-ICT Call 8 Project No

4 interest to BigFoot. We describe scalable machine learning techniques with a bias towards clustering, as it is an application which is central to our use cases. The output of analysis techniques such as clustering generally need tools to be examined: we also provide an overview of technique and tools for visualization. We conclude the section by providing a more focused overview of the big data analytics applications that are used for the smart grid and for cyber security. In Section 6, we conclude this document by presenting existing data analytics projects that BigFoot compares to. FP7-ICT-ICT Call 8 Project No

5 Contents 1 Introduction Cyber Security Smart Grids Data Collection and Integration Current Practices in Data Collection and Integration SmartGrid Data Collection Data Management Data Warehousing Row-oriented Data Warehousing Column-oriented Data Warehousing NoDB Distributed Data Stores NoSQL Data Stores SQL (Relational) Data Stores Distributed File Systems Big Data: Computational Frameworks Batch Processing MapReduce Libraries and Query Languages Optimizing Batch Processes Special-Purpose Frameworks Interactive Query Processing Stream Processing Big Data: Analysis and Applications Scalable Machine Learning and Data Mining Machine Learning Clustering K-means and Lloyd s algorithm Scalable Clustering Other Clustering Research Trends Implementation Efforts Big Data Visualisation Basic Techniques More Advanced Techniques FP7-ICT-ICT Call 8 Project No

6 5.2.3 Scalable Visualization Techniques Open Source Tools for Big data Visualization Commercial Tools for Big Data Visualization Big Data Analytics in SmartGrid Big Data Analytics in Security Open Source Distributions and Commercial Products Open Source Distributions CDH Hortonworks Commercial Products Greenplum Splunk Revolution Analytics Terracotta Teradata SAP HANA Oracle Big Data Solution SAS Netezza HP Vertica Conclusion 76 FP7-ICT-ICT Call 8 Project No

7 1 Introduction Big data analytics has been a hot research topic in recent years. Big data refers to a collection of very large data sets that qualify the 3Vs definition; high volume (data size), high velocity (data growth rate), and high variety (data types and sources). Examples of such data sets include public sector data, complex physical simulations, smart grid data, sensor network data, social network data, security data, stock exchange data, etc. These data sets are so complex and large with billions of records and thousands of features, it becomes impractical to process and analyze them through traditional approaches. For example, even the simplest operations, such as averaging a series of values, counting frequency of variables, joining two data sets on the basis of one common feature, require enormous amount of time to accomplish, not mentioning routine data mining task such as clustering, classification, etc. Nevertheless, these data sets contain rich information and can offer deeper understandings of a broad range of topics such as user behavior profiling, events detection, market segmentation, trend prediction, etc. In this document, we will review the current practices of Big data analytics in two particular domains of major interest: Cyber security and Smart Grids. 1.1 Cyber Security Nowadays, security analysts are challenged in their daily job of analyzing global Internet threats because of the sheer volumes of data collected around the globe [75]. In the cyber security domain, this is sometimes referred to as attack attribution and situational understanding, which are considered today as critical aspects to effectively deal with Internet attacks [125, 172, 233]. Attribution in cyberspace involves different methods and techniques, which, when combined appropriately, can help to explain an attack phenomenon by (i) indicating the underlying root cause, and (ii) by showing the modus operandi of attackers. The goal is to help analysts answer some important questions regarding the organization of cyber criminal activities by taking advantage of effective tools able to generate security intelligence about known or unknown threats. Recently, security companies have realized that massive amounts of data can improve their understanding of Internet attacks. For example, security and threat analysis services collect hundreds of thousands of seemingly unique malware samples per week [224], and monitor attack activity coming FP7-ICT-ICT Call 8 Project No

8 from millions of sensors distributed worldwide [223]. This requires highly scalable analysis tools to classify, correlate and prioritize security events, depending on their likely impact, threat level and possibly many other criteria. Additionally, it is now widely recognized by world-renowned security experts that Cyber Situational Awareness (or Cyber-SA ) is becoming a crucial research area that must be developed to cope with the growing number of threats, but also with their degree of sophistication [125, 259, 62]. Dozens of Advanced Persistent Threats (or APT 1 ) involving major corporations have been reported in the news in the past 18 months. Recent reports by the Ponemon Institute [232] and the Security for Business Innovation Council also indicate an escalation of those threats [233]. The use of scalable algorithms and systems to process and interact with large amounts of data in cyber security applications such as the ones described above is still in its infancy. For example, implementations of even simple clustering algorithms that are largely used in many fields are inefficient and do not make an appropriate use of the underlying system resources. Moreover, recent works showed that existing analysis methods are ineffective to address these challenges [234, 237]. Large-scale attack phenomena (e.g. malware families propagating in the Internet [152, 153], botnet activities [64, 236], spam campaigns [235], or rogue AV software campaigns [59], etc) are often largely distributed and their lifetime can vary from a few days to several months. They typically involve a considerable amount of features interacting in a non-obvious way, which makes them inherently complex to identify [29, 234]. To meet this challenge, cyber security needs to advance technologically to pro-actively analyze enormous security data. This requires a paradigm shift from rules, signatures and firewalls to automatic threat attribution on top of big data analytics. 1.2 Smart Grids There is no universally accepted definition of the Smart Grid but according to the European Technology Platform Smart Grid (ETPSG), the smart grid is an electricity network that can intelligently integrate the actions of all users connected to it generators, consumers and those that do both in order to efficiently deliver sustainable, economic and secure electricity supplies. Some tangible benefits of the Smart Grid should include: 1 APT: cyber attack that is highly targeted, thoroughly researched, well-funded, and tailored to a particular organization, and which employs multiple vectors to compromise enterprise networks. FP7-ICT-ICT Call 8 Project No

9 Figure 1: Architecture of a Smart Grid (Source GTM Research) Meter reading costs reduction and accuracy increase Billing improvement and reduction on the number of complaints on meter reading Collection time and rate improvement The possibility to remotely connect or disconnect a meter Complex tariff system implementation Network planning and operation improvement Energy demand management Maintenance costs reduction and reliability improvement To be able to enhance the reliability, efficiency, and security, and also to reduce the unwanted environmental impacts of current power grids, a smart grid should possess some additional elements and capabilities (see Figure 1): 1. Smart meters, giving the possibility to precisely measure and record at regular intervals the energy usage at customer premises and at intermediate points in the grid, such as transformers and substations. Smart meters also enable two-way communications that is, the ability to both read and send information to the meters. FP7-ICT-ICT Call 8 Project No

10 2. An Advanced Metering Infrastructure (AMI), enabling the collection and distribution of information between customers and other entities. 3. A Meter Data Management (MDM) system, for storing and analyzing the data from the AMI and also for providing subsets of that this data to customers and other computer systems. 4. Secure communication networks, interconnecting the different domains of the smart grid, i.e. Electricity generation, transmission, distribution, customers, markets and finally service providers such as Grid- Pocket. It is expected that the different elements of Smart Grids will generate massive amounts of data, structured and unstructured, that would need to be stored, integrated, retrieved and analyzed and so for which original high performance computer systems [191, 221] and algorithms [168, 20] have to be designed and evaluated. The rest of this document is organized as follows. Chapter 2 covers the current practices in data collection and integration. In particular, Section 2.2 reviews the current practices and standards in exchanging and collecting SmartGrid data. Chapter 3 then covers the topics of the state-of-the-art data management techniques in the Big Data context. First, it reviews two major storage layouts, row-oriented and column-oriented (e.g. MonetDB, C-Store) data warehousing, in Section 3.1. NoDB, a new paradigm in database systems, is reviewed in Section Section 3.2 overviews some of the most relevant distributed data stores. Furthermore, different distributed file systems, GoogleFS and HDFS, are also discussed in Section 3.3. The project covers a batch analytics engine meant to perform bulk data analysis, and an interactive query engine for selective, low latency, queries. Chapter 4 reviews related work for the two engines. Section 4.1 reviews MapReduce engines (e.g. Hadoop, Spark), high-level languages deployed on top of MapReduce, optimization efforts and some alternative programming models. Current practices in interactive query processing, such as Google Dremel and Apache Drill, are discussed in Section 4.2. Another important part of BigFoot project is analytics and applications. Chapter 5 systematically reviews scalable machine learning and data mining techniques in Section with a focus on clustering algorithms. Various techniques and tools that provide visualizations to assist the analysts with FP7-ICT-ICT Call 8 Project No

11 a variety of ways to interact with big data are reviewed in Chapter 5.2. Advanced techniques specific to SmartGrid and Cyber security are individually covered in Section 5.3 and Section 5.4. Open source distributions and commercial products for big data analytics are discussed and reviewed in Chapter 6. The last chapter summarizes the findings of this survey and briefly outlines the implications of this survey and future developments on the project. FP7-ICT-ICT Call 8 Project No

12 2 Data Collection and Integration Big data are usually collected through various data sources: data warehouse, on the Web, networked machines, virtual machines, sensors over the network, legacy applications and clusters, etc. The first challenge is integrating multiple data sources in an automated, scalable way to fuse and store these raw, heterogeneous data. In this section, current practices in data collection and integration are discussed. We especially review the current practice and standard in SmartGrid data collection due to its unique complexity. 2.1 Current Practices in Data Collection and Integration A common use case for batch engines is processing logs from a large number of machines. This can pose problems, as this data typically resides on a very large number of small files and needs to be collected and compacted onto fewer large files in order to facilitate fast handling by processing engines such as Apache Hadoop. Apache Flume and Apache Hadoop Chuckwa [195] are systems that automate this task, saving data to distributed data stores such as HDFS and HBase, and providing tunable choices in the trade-off between latency, reliability and throughput. In many other cases, data is fed to big data processing engines from and to relational databases. Apache Sqoop is a tool that allows transferring data, in a scalable way, between databases and the Hadoop Distributed File System; this is done using Hadoop to orchestrate the data transfer. 2.2 SmartGrid Data Collection Smart Grids will generate high volumes of complex data that, once integrated with other key information collected from outside the grid, will have a strong potentiality for developing new applications and services. Examples of data of interest for Smart Grid applications are : Electrical quantities, such as power consumption, power quality or voltage. Meteorological parameters, such as temperature, humidity, cloud cover or wind. Indoor parameters, such as temperature, ambient noise or brightness. Events / Signal / Alerts possibly coming from various entities in the grid. FP7-ICT-ICT Call 8 Project No

13 Information about the buildings such as the date of construction, the square-footage, or about the customers such as electricity rate plans or socio-demographics. Data from Geographical Information Systems. Legal information, for example related to the privacy of the data and their terms of use. Electricity prices from the market or from forecasting systems. Rate plans from utility companies. Behavioural data collected by Customer Relationship Management systems such as web/mobile interface logs or questionnaires. These data appear to be extremely diverse in their nature, their format, their frequency, their amount, their scale, their provenance, their privacy and so on. For example, in the case of the electricity related data, numerous electrical quantities are likely to be informative and these quantities can be either measured or predicted. Moreover, they can correspond to multiple scales, e.g. per device, per household or per geographic area. They can be sampled at multiple frequencies varying from one register read per second to one every hour. They can be collected in near real-time mode as data streams or dropped in a repository in batch mode. In terms of the volume of data, accounting for a data payload estimate of 80 bytes for one meter reading, a national scale utility company such as the French Électricité de France anticipates that a total of around 120 TeraBytes of uncompressed data should be collected every year [127]. Clearly, in the European liberalized electricity market, the Smart Grid landscapes will be very intricate, with many separate entities, each having their own roles and goals but also potentially conflicting interests. These entities will be generating their own sets of data and it is still unclear yet how they will exchange them. According to Roberta Bigliani, head of IDC Energy Insights for Europe, commenting on the Smart Grid data: We could be in a situation where we are creating silos [n.b. data in a silo remains sealed off from the rest of the organization] of data rather than making more consistent availability of the data, data needs to be validated and translated into a meta-data model, to create something that is usable by multiple applications. IT people need to work with the line of business to define a master data sort of approach and try to create a layer where all the data coming FP7-ICT-ICT Call 8 Project No

14 Figure 2: Typical M2M Smart Metering infrastructure (Source ETSI [164]) from meters or operational systems, are transformed into pieces of data that different applications can call. Building systems capable of collecting and integrating all these data is one of the main challenges of the Smart Grid. Recently, some standards have started to emerge. Below we discuss two of them, ETSI-M2M for exchanging data streams and openespi for batches of data. ETSI-M2M The European Telecommunications Standards Institute (ETSI) produces globally-applicable standards for Information and Communications Technologies. In 2010, ETSI has established a new machine-to-machine (M2M) technical committee to address the issues of M2M communication with smart meters [164], such as how to obtain meter reading data, install, configure and maintain the smart metering information system, support prepayment functionality, monitor power quality data, or manage outage data. All these use cases are discussed by the committee in terms of general description, stake-holders, scenario, information exchanges, and potential new requirements. The goal is to establish a standard for an M2M architecture that can be used for the exchange of smart meter data and events between machines involving communications across networks without requiring human intervention as shown in Figure 2. GridPocket is a forerunner in the adoption of ETSI-M2M standard in the context of smart meters and has already developed a working implementation. As part of the GridTeams project ( this implementation was used to collect streams of active power and en- FP7-ICT-ICT Call 8 Project No

15 ergy, together with indoor temperature measurements, at a rate of one read per second. This project has involved 30 households in the Cannes area (France), for a period of one year starting in January openespi The Energy Services Provider Interface (ESPI) is a standardized process and interface for the exchange of a retail customers energy usage information between utility companies and authorized third party service providers. The standard provides model business practices, use cases, models and an XML schema that describe the mechanisms by which the exchange of energy usage information may be enabled. An example of meter data in openespi (an open source implementation of the ESPI standard) XML format is presented in Figure 3. FP7-ICT-ICT Call 8 Project No

16 Figure 3: Example of meter data in openespi XML format displayed in a Web browser using the appropriate XSLT file (Source GreenButton [2]) FP7-ICT-ICT Call 8 Project No

17 3 Data Management In this section we overview the state-of-the-art techniques for data management and warehousing relevant in the context of Big Data processing. We first overview data warehousing techniques in Section 3.1, followed by the overview of NoSQL and SQL distributed data stores in Section 3.2 and distributed file systems for Big Data in Section Data Warehousing The way data is stored in a database management system (DBMS) determines the way we evaluate the queries. Both the storage and the access methods define the possible optimizations we can think of and implement over an existing architecture. Two major storage layouts for DBMSs have been proposed, i.e., the row-oriented and the column-oriented architecture. We also discuss a novel approach in which the queries are performed on raw data. In this section we discuss the architectural and design principles of data warehousing the overview of commercially available solutions is captured in Section Row-oriented Data Warehousing In a row-oriented database system we store and process data one tuple at a time, i.e., one row of a table at a time. The storage layout in a row-store is typically based on pages. Each page holds a certain amount of data, e.g., 8KB, and contains a number of rows from a given table. Given the variety of possible data types in a table, we need to maintain a number of metadata entries for each page, for each row in a page and for each attribute in a row. For example, we need to know information regarding data sizes, starting position of attributes and rows, etc. All this is necessary so that we can navigate through a page. This storage layout implies that we need to read from disk (or from memory) all the table attributes, even if our query requires a subset of them. The processing model in a row-store is typically based on the volcano ideas [97], i.e., the query plan is given one tuple at a time. Each tuple goes all the way through every operator in the plan, before we move on to the next tuple. During query processing, we continuously need to move on to the next attribute of a row which means we need to know how many bytes we need to read from the row and from which point exactly. We also need to call the next operator in the plan and be aware of the data type which FP7-ICT-ICT Call 8 Project No

18 can of course be different than that of the previous attribute. Then, once we are done with one row, this has to happen again with the next row and so on. Row-store technology has been the basis for all major commercial products. It prevailed in the early years for good reasons, e.g., organizing data in the form of tuples allows to easily load, update and process all relevant data given a database entry. This kind of processing was more typical when databases where mostly used for online transaction processing (OLTP) Column-oriented Data Warehousing Over the years, the application needs changed. In addition to OLTP, applications now critically need the ability to handle analytical queries for online analytical processing (OLAP). This kind of queries do not always need to process full tuples. Instead, they focus on analyzing a subset of a table s attributes, e.g., running various aggregations to understand and analyze the data. For this kind of applications a column-store architecture seems more natural which lead to the design of a number of very interesting systems, e.g., MonetDB [175], MonetDB/X100 [39] and C-Store [219]. These systems are originally inspired by the Decomposition Storage model (DSM) [54]. Column-oriented DBMSs store data one column at a time as opposed to one tuple at a time. This brings the obvious benefit of allowing a system to benefit a lot in terms of I/O for queries that require only part of a table s attributes. There, we only load the attributes (i.e., columns) that are relevant to our query, instead of loading needless data. However, a column-store is much more than simply storing data one column at a time. They offer a wide range of opportunities for further optimizations. Column-specific compression techniques [4, 268], achieve significant compression level. For instance, dictionary compression in a row-store would typically happen at the page level i.e., creating one dictionary for each page. However, the chance for compression are restricted since there are typically expected diverse data types and unrelated values in a tuple format. Conversely, in a column-store a page contains only tuples of a single attribute, increasing the possibilities of finding good compression cases. Another, optimization in the column-store architecture is the bulk processing model. There we typically process one column at a time, e.g., MonetDB [175], or one chunk at a time [243]. This approach allows the query engine to exploit CPU- and cache-optimized vector-like operator implementations throughout the whole query evaluation allowing to minimize function calls, type casting, various meta-data handling overheads, etc.. FP7-ICT-ICT Call 8 Project No

19 In general, row-store architectures are most appropriate when the database is mostly used for online transaction processing (OLTP). There, we expect a large number of short on-line transactions. On the other side, column-store architectures are most appropriate for applications that handle analytical queries for online analytical processing (OLAP). There, we expect relatively low volume of transactions while queries are often very complex and involve aggregations but usually focus on a subset of a table s attribute. MonetDB MonetDB pioneered column-store solutions since 1993 and constitutes the first open-source full fledged column-oriented DBMS. In MonetDB, every n-ary relational table is represented as a collection of Binary Association Tables called BAT s [38]. A BAT represents a mapping from an oid-key to a single attribute attr. Its tuples are stored physically adjacent to speed up its traversal, i.e., there are no holes in the data structure. For a relation R of k attributes, there exist k BATs, each BAT storing the respective attribute as (key,attr) pairs. The system-generated key identifies the relational tuple that attribute value attr belongs to, i.e., all attribute values of a single tuple are assigned the same key. For base tables, they form a dense ascending sequence enabling highly efficient positional lookups. Thus, for base BATs, the key column is a virtual non-materialized column. For each relational tuple t of R, all attributes of t are stored in the same position in their respective column representations. The position is determined by the insertion order of the tuples. This tuple-order alignment across all base columns allows the column-oriented system to perform tuple reconstructions efficiently in the presence of tuple order-preserving operators. Basically, the task boils down to a simple merge-like sequential scan over two BATs, resulting in low data access costs through all levels of modern hierarchical memory systems. MonetDB is a late tuple reconstruction column-store. Thus, when a query is fired, the relevant columns are loaded from disk to memory but are glued together in a tuple N-ary format only prior to producing the final result. Intermediate results are also materialized as temporary BATs in a column format. We can efficiently reuse intermediate results by recycling pieces of (intermediate) data that are useful for multiple queries [122] or in sliding window-based processing [155]. C-Store C-Store is a column oriented database system and its main architecture novelty is that each column/attribute is sorted and this order is propagated to the rest of the columns. Multiple projections of the same FP7-ICT-ICT Call 8 Project No

20 relation can be maintained, up to one for each attribute. The sorted order is exploited for fast selections while the alignment across the columns of each projection is exploited for fast tuple reconstruction. To handle the extra storage space required by the multiple projections, compression is extensively used [4] NoDB As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications, like scientific data analysis and social networks, already avoid using database systems due to the complexity and the high data-to-query time. For such applications data collections keep growing fast, generating too much data to move, store, let alone analyze. Alagiannis et al. [10] design the road-map of a new paradigm in database systems, called NoDB, which does not require traditional data loading while still maintaining the whole feature set of a modern database system. In particular, NoDB makes raw data files a first-class citizen, fully integrated with the query engine. In situ query processing, however, creates new bottlenecks, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs, which are addressed by the NoDB prototype system [10, 11]. The techniques used to address these involve an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. The prototype implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. 3.2 Distributed Data Stores Here we overview some of the most relevant distributed data stores applicable in the Big Data context, covering both NoSQL and SQL (relational) data stores. Our focus is on currently most popular solutions more specialized and comprehensive reviews of NoSQL and SQL scalable distributed data stores can be found in, e.g., [44, 220] NoSQL Data Stores Driven by scalability requirements of Big Data, NoSQL ( not-only SQL ) data stores have recently emerged as alternatives to classical relational DBMSs. Namely, NoSQL data stores rely on horizontal scaling which relies on partitioning a data store across several machines to cope with massive amounts FP7-ICT-ICT Call 8 Project No

21 of data. A specific architecture for horizontal scaling is the shared-nothing architecture in which each machine is independent and self-sufficient and none of the machines share memory or disk storage. In a shared-nothing architecture, each portion of a partitioned data store residing on a given machine is called a shard and the data store partitioning process is called sharding. Horizontal scaling and sharding are in sharp contrast to the vertical scaling approach, traditionally employed in commercial DBMSs, which relies on enhancing the hardware characteristics (e.g., CPU/memory) of a single machine in order to provide scalability. In contrast to SQL databases, NoSQL data stores typically have a simpler API interface (e.g., a key-value store). Often (though not always), NoSQL stores also feature weaker consistency models than the ACID transactions of most relational DBMSs. NoSQL stores also excel in ability to dynamically add new attributes to data records. One way to classify NoSQL data stores is with respect to their different data models. We distinguish three main categories of NoSQL data store models [44]: Key-value data stores (KVS). These data stores store values associated with an index (key). KVS systems typically provide replication, versioning, locking, transactions, sorting, and/or other features. The client API offers simple operations including puts, gets, deletes, and key lookups. Notable examples include: Amazon Dynamo [69], Project Voldemort [248] and RIAK [200]. Document data stores (DDS). DDS typically store more complex data than KVS, allowing for nested values and dynamic attribute definitions at runtime. Unlike KVS, DDS generally support secondary indexes and multiple types of documents (objects) per database, as well as nested documents or lists. Notable examples include Amazon SimpleDB [214], CouchDB [58], Membase/Couchbase [57] and MongoDB [176]. Extensible record data stores (ERDS). ERDS store extensible records, where default attributes (and their families) can be defined in a schema, but new attributes can be added per record. ERDS can partition extensible records both horizontally (per-row) or vertically (per-column) across a datastore, as well as simultaneously using both partitioning approaches. Notable examples include Google BigTable [45], HBase [87] and Cassandra [145]. FP7-ICT-ICT Call 8 Project No

22 In the following we (non-exhaustively) overview specific NoSQL systems, focusing on the above mentioned state-of-the-art examples. Amazon Dynamo Dynamo [69] is a key-value distributed storage system that is developed and used by Amazon. Dynamo is a structured overlay based on consistent hashing with maximum one-hop request routing. Dynamo uses a vector clock scheme to detect update conflicts and a write operation requires a read of the timestamps (in fact the vector timestamps that are employed). Reading the timestamps for every write, however, can be very limiting for performance in cases of high write throughput. Project Voldemort Project Voldemort [248] is an open-source distributed key-value data store used by LinkedIn for high-scalability storage. Like Dynamo, Voldemort uses consistent hashing for data partitioning and supports virtual nodes. It also supports pluggable data placement strategies to support geographically separated datacenters. Voldemort uses vector clocks to establish version ordering. Server failures and recoveries are handled automatically and data is replicates over multiple servers. Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, Avro and Java Serialization. Voldemort single node performance is at the range of 10-20k operations per second depending on the machines, the network, the disk system, and the data replication factor. Voldemort provides eventual consistency, just like Amazon Dynamo. RIAK RIAK [200] is another open-source distributed key-value data store, developed by Basho Technologies, that provides tunable consistency. Consistency is tuned by specifying how many replicas must respond for a successful read/write operation and can be specified per-operation. Just like Dynamo and Voldemort, RIAK relies on consistent hashing for data partitioning and vector clocks for versioning. Riak also includes a MapReduce mechanism for non-key-based querying. MapReduce jobs can be submitted through the RIAK s HTTP API or the protobufs API. To this end, the client makes a request to RIAK node which becomes the coordinating node for the MapReduce job. Amazon SimpleDB SimpleDB [214] is Amazon s pay-as-you-go proprietary document data store offered as a service in Amazon s AWS cloud FP7-ICT-ICT Call 8 Project No

23 portfolio. A key technical benefit of Amazon SimpleDB is automatic georeplication. Every time a user stores a data item, multiple replicas are created in different data centers within a selected geographical region, which enables high availability and data durability in the unlikely event of a data center outage. SimpleDB also automatically indexes data to enable efficient queries. Unlike with key-value datastores, SimpleDB supports more than one grouping in one database: documents are put into domains, which support multiple indexes. SimpleDB data model is comprised of domains, items, attributes and values. Domains are collections of items that are described by attribute-value pairs. SimpleDB constrains individual domains to grow up to 10 GB each, and currently has a limit of 100 active domains. CouchDB Apache CouchDB [58] is an open source document data store that stores JSON objects that consist of named fields without predefined schema. Field values can be strings, numbers, or dates; but also ordered lists and associative arrays. CouchDB uses JavaScript for MapReduce queries, and regular HTTP for an API and provides ACID semantics at the document level, but eventual consistency otherwise. To support ACID on document level, CouchDB implements multi-version concurrency control. CouchDB structure stored data into views; each view is constructed by a JavaScript function that acts as the map phase in MapReduce. CouchDB was designed with bi-direction replication (or synchronization) and off-line operation in mind. Namely, CouchDB can replicate to devices (like smart-phones) that can go offline and later sync back the device. Membase/Couchbase Couchbase [57], originally known as Membase, is an open source, distributed document oriented data store that is optimized for interactive applications. Couchbase/Membase has initially grown around memcached [171], a popular distributed in-memory key-value cache system, by adding to memcached features like persistence, replication, high availability, live cluster reconfiguration, re-balancing, multi-tenancy and data partitioning. Couchbase supports fast fail-over with multi-model replication support for both peer-to-peer replication and master-slave replication. Couchbase has only recently migrated from a key-value store to a document data store, with version 2.0 bringing features like JSON document store, incremental Map Reduce and cross datacenter replication. Couchbase stores JSON objects with no predefined schema. FP7-ICT-ICT Call 8 Project No

24 MongoDB MongoDB [176] is an open source document-oriented data storebase system. MongoDB stores structured data as JSON-like documents with dynamic schemas. MongoDB supports queries by field, range queries, regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Any field in a MongoDB document can be indexed and secondary indices are also available. MongoDB supports master-slave replication. A master can perform reads and writes, whereas a slave copies data from the master and cannot be used for writes. MongoDB scales horizontally using sharding and can run over multiple servers, balancing the load and/or duplicating data to keep the system up and running in case of hardware failure. MongoDB supplies a file system function, called GridFS, taking advantage of load balancing and data replication features over multiple machines for storing files. MapReduce can be used in MongoDB for batch processing of data and aggregation operations. BigTable BigTable [45] is a distributed storage system designed by Google. BigTable stores and manages petabytes of structured data across thousands of commodity servers. Google initially designed BigTable as distributed data storage solution for several applications (like Google Earth and Google Finance), aiming at providing flexible, high-performance solution for different application requirements. BigTable stores information using interpreted byte arrays. In each such array two arbitrary string values (the row key and the column key) are stored, along with the timestamp forming a threedimensional mapping. While BigTable is used by database applications it is not a typical relational database system; it is better defined as a sparse, distributed, multi-dimensional sorted map. HBase HBase [87] is a distributed, column-oriented, data storage system offering strict consistency designed for data distributed over numerous nodes. HBase is largely inspired by Google s BigTable [45] and is designed to work well with Hadoop which is an open-source implementation of Google s MapReduce [67] framework. The default distributed files system for Hadoop (HDFS) is designed for sequential reads and writes of large files in a batch manner. This strategy disallows the system to offer close to real-time access which requires efficient random accesses of the data. HBase is an additional layer on top of HDFS that efficiently supports random reads - and in general access - on the data, using a sparse multi-dimensional sorted map. FP7-ICT-ICT Call 8 Project No

25 Cassandra Cassandra [145] is a distributed data storage system developed by Facebook which, similarly to BigTable, is designed for managing very large amounts of structured data spread out across many commodity servers, providing a key-value store with tunable consistency. The main goal of Cassandra is to provide a highly available service with no single point of failure. The Cassandra API consists of three very simple methods (insert, get, delete) and it allows the user to manipulate data using a distributed multi-dimensional map indexed by the key. The different attributes (columns) of the data stored by Cassandra are grouped together into sets (called column families ). Cassandra exposes two kinds of such families: simple column families and super column families, where the latter are a column family within a column family. This allows a key to map to multiple values SQL (Relational) Data Stores Several contemporary relational DBMSs provide horizontal scaling while maintaining the classical row-oriented data model. These systems try to reconcile the expressiveness of the SQL with the scalability of NoSQL data stores. Here we overview some of these novel relational data stores; in addition, we note that that major proprietary RDBMSs such as IBM DB2, Oracle, and Microsoft SQL Server also feature horizontal scaling. MySQL Cluster MySQL Cluster [53] is a write-scalable, ACID-compliant transactional database that promises 5 nines of availability. It builds upon MySQL by applying a shared-nothing architecture with no single point of failure. It provides auto-sharding, with automatic and transparent database partitioning across commodity nodes which may be geographically replicated. MySQL also provides multi-master replication in which each data node can accept write operations. It uses in-memory tables and indexes in order to provide low latency, targeting real-time responsiveness. MySQL Cluster aims at bridging the gap between SQL and NoSQL ecosystems by providing both APIs. MySQL Cluster supports fully structured relational data model by default, but also supports key-value data model through its Memcached API. VoltDB VoltDB [249] is an in-memory distributed database that allows partitioning and scaling across a shared-nothing server cluster. For highavailability, VoltDB uses synchronous multi-master replication similar to MySQL Cluster, tables are partitioned over multiple servers, and clients can FP7-ICT-ICT Call 8 Project No

26 query any server. VoltDB features durability using master-slave replication over the wide area for disaster recovery. VoltDB supports data export to Hadoop, aiming at simplifying Big Data analytics. One of the potential limitations of VoltDB is its orientation towards fitting database into memory which may be insufficient for some Big Data applications. Megastore and Spanner Megastore [23] and Spanner [55] are two distributed database systems deployed internally at Google. Megastore provides fully serializable ACID semantics within fine-grained partitions of data. Megastore provides a semi-relational data model layered on top of BigTable, thus enhancing BigTable s limited API and eventual consistency model that complicate distributed application development. As in an RDBMS, the data model in Megastore is declared in a schema, strongly typed and than mapped to BigTable. Each Megastore schema has a set of tables, each containing a set of entities, which in turn contain a set of properties. Each write in Megastore is synchronously replicated over wide area network. Spanner follows Megastore in its semi-relational data model, and a similar schema language. However, Spanner considerably improves on Megastore performance by renouncing on BigTable legacy and by relying on two main building blocks: (i) a pipelined implementation of the Paxos fault-tolerant replication protocol [146] and (ii) Google TrueTime API which is a clock synchronization framework which exposes clock uncertainty; internally True- Time implementation uses GPS and atomic clocks to keep the distributed time synchronization uncertainty small (typically less than 10ms). 3.3 Distributed File Systems The SNIA file system taxonomy [241] classifies all non-local file systems into two groups: (i) Shared File Systems (SFS), comprising SAN (storage area network) File Systems (SAN FS) and Cluster File Systems (CFS) and (ii) Networked File System (NFS) comprising of Distributed File Systems (DFS) and Distributed Parallel File Systems (DPFS). In practice, non-local commercial and open source file systems rarely fall exclusively in one of these categories. For example, the IBM General Parallel File System [95, 207] can be instantiated in all of the above mentioned file system classes. Hence, here we simply call all non-local file systems distributed while noting the difference with respect to the SNIA taxonomy. We briefly summarize the most common architectures of distributed file systems and focus on two distributed file systems that probably had most impact on open-source Big Data management: GoogleFS [89] and Hadoop DFS (HDFS) [96, 212]. FP7-ICT-ICT Call 8 Project No

27 A most basic environment of a distributed file system is a configuration in which several clusters in a server are directly attached to same storage. The direct connection here means that each shared storage device is concurrently available to all DFS servers. The connection can be established using a SCSI, SAN, Infiniband, Fibre Channel, virtual and other interfaces. The main task of a DFS is to synchronize concurrent access to data. As often is the case with Big Data, single storage unit (e.g., a SAN) rarely suffices. In this case files are often distributed in separate physical locations across several file systems servers (or clusters, each of which can be further connected to a local SAN) in a way that resembles database sharding. File system clients access such a distributed file system using a block level interface over TCP/IP, called a network protocol (e.g., NFS, CIFS, NSD). For parallel applications, an alternative variant with data striping instead of data sharding is used. Additionally most current file systems support storage virtualization. Virtualization is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device. The technology can be placed on different levels of a file system (e.g., block virtualization or file virtualization). There are many commercial distributed file systems that target Big Data, most of which fall in the CFS category in the SNIA taxonomy. Prominent examples include Oracle CFS, IBM GPFS, Symantec s Veritas CFS, RedHat Global FS, VMWare VMFS, Terrascale TerraFS, GlusterFS and others. GoogleFS The Google File System (GoogleFS) [89], is Google s proprietary scalable distributed file system for large distributed data-intensive applications. GoogleFS runes on inexpensive commodity hardware while providing fault-tolerance and delivering high aggregate performance to a large number of clients. It is widely used within Google to support many distributed applications. GoogleFS provides non-posix file system interface with classical hierarchical directory/file structure. GoogleFS is designed to store a modest number of large, multi-gb files. Small files are supported but the system is not optimized for them. The workloads are inspired by MapReduce and primarily consist of large streaming reads and small random reads. In addition, typical write workload consists of many large, sequential writes that append data to files. Like for reads, small random writes are supported but are not necessarily efficient. A GoogleFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients. The master maintains all file sys- FP7-ICT-ICT Call 8 Project No

28 tem metadata including namespace, access control, and mapping of files to chunks, which are fixed-size (64MB) portions of data assigned to chunkservers. Due to specific workloads, clients and chunkservers do not need to cache data. To provide good scalability, GoogleFS provides relaxed consistency model and replicates data caring for disk and machine failures but also rack failures. Master node is additionally replicated for availability. HDFS Hadoop Distributed File System [96, 212] is the open source Hadoop file system implementation that closely follows GoogleFS. It is designed for write once, read many workloads, and specifically tailored to specific requirements of MapReduce (Hadoop). HDFS does not handle concurrency, but allows for data replication in the vein of GoogleFS. Just like GoogleFS, HDFS optimizes throughput, not latency. In HDFS parlance, the master node is called NameNode whereas storage nodes (chunkservers in GoogleFS) are called DataNodes. Data replication is handled among DataNodes themselves (similar to chain replication) and follows distance rules. HDFS workloads are batch oriented. A typical read from a client involves: (i) contacting the NameNode to determine DataNode where the actual data is stored, (ii) NameNode replies with block identifiers and DataNode locations, (iii) client contacts DataNode(s) to fetch data. On the other hand, a typical write from a client involves: (i) contacting the NameNode to update the namespace and verify access control permissions, (ii) NameNode allocates a new block on a suitable DataNode, (iii) the client directly streams to the selected DataNode. Currently, HDFS files are immutable. By default, HDFS supports 3-way replication with each replica stored on a different rack. Block replication benefits MapReduce since scheduling decisions can take replicas into account and hence better exploit data locality. FP7-ICT-ICT Call 8 Project No

29 4 Big Data: Computational Frameworks The BigFoot project covers a batch analytics engine meant to perform bulk data analysis, and an interactive query engine for selective, low latency, queries. This Section reviews related work for the two engines. 4.1 Batch Processing Batch processes are eminently meant to process large amounts of raw data, in order to produce results that can be afterwards analyzed efficiently by the interactive query engine. MapReduce is currently the most widely used programming paradigm for this kind of applications: in the following, we review MapReduce engines, high-level languages deployed on top of MapReduce, optimization efforts and we discuss some alternative programming models MapReduce The MapReduce [67] paradigm has been developed originally at Google. The approach takes inspiration from functional programming, and it is based on a map phase where local data is processed and transformed in a series of (key, value) pairs. Those pairs are then shuffled across the data center to reducers, each responsible of a set of keys, in order to produce aggregated results. By virtue of having no centralized bottleneck, software that adopts the MapReduce solution can scale to thousands of machines; this allows scaling horizontally a cluster, i.e. adding more machines. The success of this approach can be largely attributed to the fact that the alternative of scaling vertically, by using more powerful machines, is generally much more expensive and in several cases unfeasible. Hadoop Apache Hadoop is an open-source framework written in Java that implements the MapReduce programming model discussed above. Rather than rely on hardware to deliver high-availability, Hadoop itself detects and handles failures at the application layer, delivering a highly-available service on top of a cluster of commodity machines. In addition to providing the execution engine for MapReduce jobs, Hadoop provides the Hadoop Distributed File System (HDFS) (see Section 3.3): a scalable file systems designed for the typical workloads of MapReduce workloads, optimized for the throughput of reading and writing large files and providing data locality awareness to the execution engine. FP7-ICT-ICT Call 8 Project No

30 The Hadoop framework is easy to extend and it is the base of a large software ecosystem, including, within the products discussed in this document: Hive (Sec ); HBase (Sec ); Cascading (Sec ), in turn used by Cascalog, Scalding, etc.; Apache Pig (Sec ); Cloudera s Distribution of Hadoop (Sec ); Hortonworks Data Platform (Sec ); Teradata Aster (Sec ); Oracle Big Data Solution (Sec ); SAS (Sec ); HP Vertica (Sec ). Spark Spark [262] is an in-memory data analysis system with a MapReduce programming model written in Scala. Spark is based on Resilient Distributed Datasets (RDDs), fault-tolerant data structures for cluster computing. RDDs are immutable, partitioned collections of objects that support a wide range of transformations. RDDs allow apps to keep working sets in memory for efficient reuse (caching). By virtue of working in memory, Spark is particularly efficient for iterative algorithms and interactive data mining Libraries and Query Languages MapReduce has proven to be a good abstraction for developing massively scalable algorithms; however, programming in MapReduce requires working at a low level of abstraction with respect to the high-level goals, and dealing with several low-level system details and configuration parameters. Here, we review a few high-level libraries and languages that successfully overcome these limitations. FP7-ICT-ICT Call 8 Project No

31 Cascading Cascading is a Java application framework to easily develop rich Data Analytics and Data Management applications on top of Hadoop. The framework provides a high-level API for defining complex data flows while hiding the complexity of the underlying system. The flows are optimized and compiled into Hadoop jobs. Cascading is developed by Concurrent, Inc. The Cascading data flow API is suitable for creating Domain Specific Languages (DSL) for languages that compile to JVM bytecode. This kind of DSL let the programmer define the data flows with another language instead of Java. There are four DSLs for four different languages: a Clojure DSL called Cascalog, a Ruby DSL called Cascading.JRuby, a Jython DSL called PyCascading and a scala DSL called Scalding. Cascalog, PyCascading and Scalding are developed and maintained by Twitter while Cascading.JRuby is developed by Etsy. Pig Apache Pig [184] is a platform for analyzing large data sets, consisting in Pig Latin, a dataflow-oriented high-level language for data analysis, and an infrastructure for evaluating these programs. Pig Latin offers a declarative syntax similar to SQL, and it can be extended with User Defined Functions (UDFs) written in Java. The Pig compiler compiles to Hadoop MapReduce jobs, and features a rule-based optimizer. Hive Hadoop Hive [238] is a data warehouse system for Hadoop that provides a mechanism to project structure into raw data and query them using a SQL-like language called HiveQL that can be easily extended with UDFs. The main difference between HiveQL and Pig Latin is that while Pig Latin is a new language with operators similar to SQL, HiveQL is syntactically almost identical to SQL. In this way, HiveQL can be used by anybody who knows SQL. HiveQL allows the user to use UDFs for queries that cannot be easily expressed in HiveQL; queries are compiled into Hadoop jobs, but they can also be executed on the Spark platform using Shark [81]. SCOPE SCOPE [266] is a language developed by Microsoft Research, with a declarative syntax reminiscent of SQL and HIVE. SCOPE is integrated and extensible within the.net platform and adopts an ad-hoc execution engine based on MapReduce; SCOPE features a cost-based optimizer, and it can perform optimizations such as early projection and filtering to minimize the amount of data shuffled between machines [101]. FP7-ICT-ICT Call 8 Project No

32 4.1.3 Optimizing Batch Processes BigFoot undertakes the endeavor of optimizing batch processing. A way to do it is by writing ad-hoc optimized libraries for widely used features, such as the ones developed at Twitter to perform machine learning in Hadoop and Pig [159]. Another possibility is performing optimizations that are transparent to users, similarly to what compilers traditionally do for highlevel programming languages. Scheduling The default scheduler in Hadoop is a simple FIFO scheduler: jobs get executed in the order they are submitted. This is known to be a problem, since typical workloads of Hadoop clusters have both heavy jobs that take long to complete, and smaller ones that should get executed quickly [48, 199]. Another scheduler, called FAIR, attempts to solve this problem by sharing computing resources between all jobs in the queue. Several works focus on optimizing the Hadoop scheduler, aiming at fairness [204, 121, 90, 112], implementing quality of service for jobs [203], scheduling jobs to meet user-provided deadlines [136], improving data locality [261], and solving the delays due to straggler machines completing their tasks in longer time [227, 228, 143, 197]. Moreover, Flex [256] is a proprietary solution that optimizes scheduling according to user-specified performance metrics. Literature on scheduling shows that size-based scheduling disciplines can obtain better results than the processor sharing discipline to which FAIR is inspired, in terms of both sojourn times and fairness [86]. Such approaches require knowing a priori the size of jobs being submitted; this is not possible in MapReduce, but several recent works [245, 244, 5, 192, 239] provide estimations for the running time as jobs are submitted. Work Sharing In database systems, multi query optimization aims at recognizing common pieces of work between queries submitted at the same time, in order to execute them only once [209, 201, 113]. Staged database systems [106] can handle efficiently parallel databases, and can perform work sharing by serving several workflows with a single stage of a query; there are cases in which this, however, may harm the running time of queries, since multicasting the results to several workflows may create a bottleneck [131]. Work sharing can be performed as well in batch processing frameworks: horizontal packing can be obtained by merging jobs that run on the same input file. Better opportunities for sharing can be obtained by prioritizing scans on seldom-accessed files, as hot ones give better opportunities for FP7-ICT-ICT Call 8 Project No

33 sharing [7]. Horizontal packing should be employed with care, as this leads to bigger jobs that may, for example, not fit in memory and create bottleneck issues similar in principle to the ones discussed above for database systems [183]. Jobs written in MapReduce only lend themselves to a limited amount of work-sharing, as Map and Reduce classes are effectively black boxes whose operations are unknown. Conversely, high-level languages provide more information about data flows, and lend themselves better to optimization: Stubby [156] is a cost-based optimized for MapReduce jobs that can perform vertical packing (i.e., collapsing sequential MapReduce jobs in a single one) in addition to the horizontal packing described before; it is based on annotations that can be generated by the execution engines of high-level languages such as Pig. Again integrated with Pig, ReStore [79] speculatively caches the results of partial computations in order to reuse them in subsequent jobs Special-Purpose Frameworks While MapReduce is by far the most widespread programming paradigm for big data batch analysis, it is arguable that there are cases for which it is not ideal. For example, a naive implementation of iterative graph algorithms requires shuffling across the network the full graph structure at each iteration, resulting in important communication overheads. Pregel [167] and GraphLab [93, 144], provide a different model, based on the think as a node abstraction, whereby messages are passed between nodes along graph edges at each iteration, according to the Bulk Synchronous Parallel paradigm; Bagel is a library that provides access to the same abstraction on top of Spark. These execution engines provide both a good abstraction and efficient performance for graph-based algorithms; however, these may be cases in which existing MapReduce solutions are good enough, given that there are tricks that can improve the performance well beyond what naive implementations can achieve. As Jimmy Lin argues [157], considered in isolation it naturally makes sense to choose the best tool for the job, but this neglects the fact that there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc. The alternative is to use a common computing platform that s already widely adopted (in this case, Hadoop), even if it isn t a perfect fit for some of the problems. In the BigFoot project, we will consider this tension, targeting solutions that fit well the job, without requiring the inter-operation of too many disparate FP7-ICT-ICT Call 8 Project No

34 solutions. 4.2 Interactive Query Processing After MapReduce, it comes to no surprise that another system used to analyze big data made by Google appeared, Dremel. Dremel, first introduced in [170], is a scalable, interactive ad-hoc query system for analysis of readonly nested data. It comes from the need to support interactive analysis of a large amount of data, while MapReduce focuses on batch jobs and maximizing throughput. It is important to highlight that Dremel is not intended as an alternative to MapReduce, rather is complementary to it and, according to its authors, is often used in conjunction with it to analyze outputs of MapReduce pipelines. Along with Dremel, a series of open source projects started, and, just like Apache Hadoop for MapReduce, Apache Drill ( and Impala ( com/cloudera/impala) were launched. Similarly to MapReduce, Dremel and its counterparts provide fault tolerant execution, and in situ data processing capabilities. In Dremel data is stored in a semi-structured format, using Google s Protocol Buffer ( Protocol Buffer is an open source project to support a language and platform agnostic system for serializing structured data. The data model is based on strongly-typed nested records. Dremel moreover uses a columnar data layout: all the values of a given field are stored consecutively, to improve the retrieval efficiency. The elements of the Protocol Buffer schema are stored in their own file, in a column-based way. The last component still missing in the Dremel description is its query language. Dremel uses its own implementation of a SQL-like language, designed to efficiently support nested storage. The execution of the queries is managed by a distributed execution engine that computes an execution plan based on a multi level serving tree: a root server receives incoming queries and, using metadata routes the queries to the lower levels in the tree. The leaf servers communicate directly with the physical storage. Therefore, each leaf server executes a portion of the query, and higher level servers aggregate the results. The most important open source projects modeled after Google s Dremel paper are Apache Drill and Impala, the latter being the most mature one. In particular, architecturally both projects are very similar, the differences lying in the fact that Impala was developed by Cloudera and is a stable and mature project, while Apache Drill is a community effort which is still in a preliminary stage. In particular, the Impala engine uses a big portion of the Apache Hive project (cf. Section 4.1.2), in particular the same meta-data, FP7-ICT-ICT Call 8 Project No

35 SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax). 4.3 Stream Processing In MapReduce, input data sets are viewed as immutable pieces of data stored as files on HDFS. This paradigm falls short when data arrives in the form of a stream; the following solutions tackle this task. S4 Developed by Yahoo!, S4 [180] is a distributed system for stream processing written in Java. S4 is decentralized and scalable with no limit on the number of nodes that can be added. It has an automatic load balancing system built-in to avoid problem with scalability. It is fault-tolerant and when a server fails, another one is ready to work on the task. It supports check-pointing to minimize state loss. The programming language used is Java. Storm Storm is a distributed real-time computation system developed by Twitter. It scales to massive numbers of messages per second by adding machines and increasing the parallelism settings of the topology. The key feature of Storm compared to other real-time computation systems is that it guarantees that every message will be processed. It is fault-tolerant: in case of failures tasks are reassigned. One key property of Storm is that it is programming language agnostic and it can be used with every language and framework. In Storm, the user specifies the topology and a mapping into tasks. FP7-ICT-ICT Call 8 Project No

36 5 Big Data: Analysis and Applications 5.1 Scalable Machine Learning and Data Mining Machine learning [173, 37] and data mining [109, 104] have been hot research topics for decades. In the big data context, one of the bottlenecks for successful inference of useful information from the data is the computational complexity of machine learning algorithms [37]. Most state-of-the-art nonparametric machine learning algorithms have a computational complexity of either O(N 2 ) or O(N 3 ), where N is the number of training examples [198]. Data mining [109, 104] uses many machine learning methods while focusing on the discovery of (previously) unknown properties from the data, suffers from similar computational complexity issues 2 in big data context. There are increasing efforts to devise scalable machine learning and data mining techniques in recent years [160, 158, 182, 139, 198] so as to gain insight from big data sets. Literature review of some research efforts can be found in [196]. In this section, we will systematically review machine learning algorithms and data mining techniques that are connected to Bigfoot project, with a focus on clustering algorithms [123] Machine Learning Machine learning [37] is a subfield of Artificial Intelligence (AI) concerned with algorithms that allow computers to learn from data with a focus on prediction and inference. Existing algorithms can be classified into three categories: supervised, unsupervised and reinforcement learning. Implementations of scalable machine learning algorithms is discussed in Section Supervised learning Supervised learning [141] is the machine learning task of inferring a function from labeled training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete) or a regression function (if the output is continuous). The inferred function should predict the correct output value from any valid unseen input object. On the basis of different characteristics and techniques of supervised learning algorithms, they can be categorized as logic based algorithms, perceptron-based algorithms, statistical learning algorithm, instance-based 2 The distinction between machine learning and data mining [1] is a controversial topic, which is outside of the focus of this document. FP7-ICT-ICT Call 8 Project No

37 learning algorithms, and support vector machine, and are briefly reviewed in this section. 1. Logic based algorithms Decision tree learning Decision tree learning [194] builds a decision tree that maps observations about an item (its features) to conclusions about the item s target value. Inductive logic programming Inductive logic programming (ILP) [178] is an approach to rule learning using logic programming as a uniform representation for examples, background knowledge, and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program which entails all the positive and none of the negative examples. 2. Perceptron-based techniques Artificial neural networks An artificial neural network (ANN) learning algorithm [36, 102, 16], usually called neural network (NN), is a learning algorithm that is inspired by the structure and functional aspects of biological neural networks. Computations are structured in terms of an interconnected group of artificial neurons, processing information using a connectionist approach to computation. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables. 3. Instance-based learning Similarity and metric learning Similar and metric learning algorithms [257] aim to learn a similarity function (or a distance metric function) given pairs of examples that are considered similar and pairs of less similar objects. It can predict if new objects are similar. This kind of learning algorithms is sometimes used in Recommendation systems. Sparse Dictionary Learning The basic idea of sparse dictionary learning is the approximation of a signal vector by a linear combination of components under certain constraints. In this method FP7-ICT-ICT Call 8 Project No

38 [142], a datum is represented as a linear combination of basis functions, and the coefficients are assumed to be sparse. Let x be a d-dimensional datum, D be a n by d matrix, where each column of D represents a basis function. r is the coefficient to represent x using D. Mathematically, sparse dictionary learning means the following where r is sparse. Generally speaking, n is assumed to be larger than d to allow the freedom for a sparse representation. Sparse dictionary learning has been applied in several contexts. In classification, the problem is to determine which classes a previously unseen datum belongs to. Suppose a dictionary for each class has already been built. Then a new datum is associated with the class such that it s best sparsely represented by the corresponding dictionary. Sparse dictionary learning has also been applied in image de-noising. The key idea is that a clean image path can be sparsely represented by an image dictionary, but the noise cannot. 4. Statistical learning algorithms Naive Bayes classifiers The Naive Bayes Classifier [128, 154] technique is based on the Bayesian theorem and the maximum a posteriori hypothesis. The classifier predicts class membership probabilities, such as the probability that a given sample belongs to a particular class. It is particularly suitable for high dimensional data due to its naive assumption of class conditional independence. Naive Bayes can be modeled in several different ways including normal, log-normal, gamma and Poisson density functions. Bayesian Networks A Bayesian network [110], belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independences via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning [49]. 5. Support vector machines (SVMs) [56, 60] are a set of related supervised learning methods used for classification and regression. Given a set of FP7-ICT-ICT Call 8 Project No

39 training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Unsupervised learning Unsupervised learning refers to the problem of trying to find a model fit to unlabeled data. Since the examples given to the learner have no labels, there is no error or reward signal to evaluate a potential solution. Approaches to unsupervised machine learning include: 1. Association rule learning Association rule learning [8, 9] is a method for discovering relations between features of the items. It is generally used to find interesting relations between variables in large databases. 2. Representation learning Classical examples include principal components analysis [255] and cluster analysis [123, 73]. Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a preprocessing step before performing classification or predictions, allowing to reconstruct the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution. Manifold learning algorithms [108, 31] attempt to do so under the constraint that the learned representation is low-dimensional. Sparse coding algorithms [149, 258] attempt to do so under the constraint that the learned representation is sparse (has many zeros). Multi-linear subspace learning [242] algorithms aim to learn low-dimensional representations directly from tensor representations for multidimensional data, without reshaping them into (high-dimensional) vectors. Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features. It has been argued that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation that explain the observed data. 3. SOM&ART Neural networks are generally supervised algorithms but some models, the self-organizing map (SOM) [140] and adaptive resonance theory (ART) [42], can be used as unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members FP7-ICT-ICT Call 8 Project No

40 of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are also used for many pattern recognition tasks, such as automatic target recognition and seismic signal processing. Reinforcement Learning Reinforcement learning [134, 222] is concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Search and Optimization Genetic programming (GP) [24] is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user-defined task. It is a specialization of genetic algorithms (GA) [92] where each individual is a computer program. It is a machine learning technique used to optimize a population of computer programs according to a fitness landscape determined by a program s ability to perform a given computational task. Scalable Machine Learning There are various efforts in making machine learning scalable to large scale data sets from different perspective. Nickel et al. [182] applied RESCAL [181], a tensor factorization for relational learning, to various machine learning tasks (e.g. prediction of unknown triples, retrieval of similar entities, collective learning, etc.) for Semantic WebOs Linked Open Data and showed that the presented approach can scale to large knowledge bases. Khuc et al. [139] build a large-scale distributed system for real-time Twitter sentiment analysis - lexicon builder and sentiment classifier on top of Hadoop and HBase, achieving high classification accuracy and scaling well with the data size. Large graph clustering [158, 160] is also an interesting scalable machine learning task and will be covered in Section Clustering Cluster analysis or clustering [123, 124] is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. Clustering does not consist of one specific algorithm, FP7-ICT-ICT Call 8 Project No

41 but is the general task to be solved. There exist various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis is thus a process of unsupervised classification, but is not an automatic task. There exists a plethora of clustering algorithms, and different categorizations. A first categorization makes a distinction between partition and hierarchical clustering. Partition techniques aim at finding the most effective partition by optimizing a criterion (e.g., minimizing the sum of squared distances within each cluster). They usually assign each object to exactly one cluster, but some variants accept outliers which do not belong to any cluster, or objects that belong to more than one cluster. Hierarchical clustering methods produce a nested series of partitions in which the decision to merge objects or clusters at each level is performed based on a linkage method (e.g. the smallest or largest distance between objects) and a criterion. An other categorization exists, based on the definition of cluster used by each algorithm. This categorization distinguishes hierarchical clustering, centroid-based clustering, distribution based clustering and density-based clustering. 1. Hierarchical clustering Hierarchical clustering, also known as connectivity based clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away as shown in Figure 4. As such, these algorithms connect objects to form clusters based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a tree diagram, which explains where the name hierarchical clustering comes from: these algorithms do not provide a single partitioning of the data set, but instead provide a hierarchy of clusters that merge with each other at certain distances. In this tree diagram, called a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don t mix. Connectivity based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of FP7-ICT-ICT Call 8 Project No

42 distance functions, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance) to use. Popular choices are: Single-linkage clustering [94] uses the minimum of object distances. Complete linkage clustering [179] uses the maximum of object distances. UPGMA (Unweighted Pair Group Method with Arithmetic Mean) [215], is a simple method generally used in bio-informatics. At each step, the nearest two clusters are combined into a higherlevel cluster. The distance between any two clusters A and B is taken to be the average of all distances between pairs of objects in A and in B, that is, the mean distance between elements of each cluster. Furthermore, hierarchical clustering can be computed agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions). While these methods are fairly easy to understand, the results are not always easy to interpret, as they will not produce a unique partitioning of the data set, but a hierarchy the user still needs to choose appropriate clusters from. The methods are usually not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge (known as chaining phenomenon, in particular with single-linkage clustering). In the general case, the complexity is O(n 3 ), which makes them way too slow for large data sets. For some special cases, optimal efficient methods (of complexity O(n 2 )) exist (see scalable clustering). 2. Centroid-based clustering In centroid-based clustering, clusters are represented by a central point, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized (this distance is also called distortion of the clusters). The optimization problem itself is known to be NP -hard, and thus the common approach is to search only for approximate solutions. A FP7-ICT-ICT Call 8 Project No

43 Figure 4: Hierarchical clustering of genes involved in leukemia. Genes are grouped together in a tree structure based on their distance in their expression pattern. particularly well known approximative method is Lloyd s algorithm, often actually referred to as k-means algorithm. It does however only find a local optimum, and is commonly run multiple times with different random initializations. A lot of variations of k-means algorithm exist (see below). Most of them require the number of clusters k to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders between clusters. 3. Distribution-based clustering The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A nice property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects FP7-ICT-ICT Call 8 Project No

44 from a distribution. While the theoretical foundation of these methods is excellent, they suffer from one key problem known as over-fitting, unless constraints are put on the model complexity. A more complex model will usually always be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult. The most prominent method is known as expectation-maximization algorithm (EMclustering [177]). Here, the data set is usually modeled with a fixed (to avoid over-fitting) number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to fit better to the data set. This will converge to a local optimum, so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to, for soft clusterings this is not necessary. Distribution-based clustering is a semantically strong method, as it not only provides you with clusters, but also produces complex models for the clusters that can also capture correlation and dependence of attributes. However, these algorithms put an extra burden on the user that has to choose the appropriate data model(s) to optimize. Furthermore, for many real data sets there may be no mathematical model available the algorithm is able to optimize. 4. Density-based clustering In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas that separate clusters are usually considered to be noise and border points. The most popular density based clustering method is DBSCAN [83]. In contrast to many newer methods, it features a well-defined cluster model called density-reachability. Similar to linkage based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects range. Another interesting property of DBSCAN is that its complexity is fairly low and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore FP7-ICT-ICT Call 8 Project No

45 there is no need to run it multiple times. OPTICS [17] is a generalization of DBSCAN that removes the need to choose an appropriate value for distance thresholds, and produces a hierarchical result related to that of linkage clustering. The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect cluster borders. On data sets with e.g. overlapping Gaussian distributions a common use case in artificial data the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously. On a mixture of Gaussian datasets, they will almost every time be outperformed by methods such as EM clustering, that are able to precisely model this kind of data K-means and Lloyd s algorithm In this section, k-means and Lloyd s algorithm [163] are reviewed in detail due to their wide applicability. K-means is the best known representative of centroid-based clustering: it aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results in a partitioning of the data space into Voronoi cells. The problem is computationally difficult (NP-hard), however there are efficient heuristic algorithms that are commonly employed and converge fast to a local optimum. Given a set of observations (x 1, x 2,..., x n ), where each observation is a ddimensional real vector, k-means clustering aims to partition the n observations into k sets (k <= n) S = S 1, S 2,..., S k so as to minimize the within-cluster sum of squares (WCSS): W CSS = arg min S k i=1 x j S i x j µ i 2 (1) where µ i is the mean of points in S i. The standard algorithm for k-means clustering uses an iterative refinement technique. It was first proposed by Stuart Lloyd in It it therefor also referred to as Lloyd s algorithm. The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic. It then repeat the next 2 steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed): FP7-ICT-ICT Call 8 Project No

46 1. Calculate the average point, or centroid, of each set via some metric (usually averaging dimensions in Euclidean space); 2. Construct a new partition by associating each point with the closest centroid, usually using the Euclidean distance function. Although relatively simple, the algorithm performs data summarization (the clusters are represented by their center), and is therefor considered as a scalable approach. Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to step 1, thus computing the initial means to be the centroid of the cluster s randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al. [103], the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means (see below). For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. Another initialization method consists in using canopy clustering, discussed below, as a first step to K-means. Since the algorithm converges slowly, and as the algorithm will often not converge due to limitations in numerical precision, real-world applications of Lloyd s algorithm usually stop once the distribution is good enough. One common termination criterion is when the maximum distance a point moves in one iteration is below some set limit. Computation time of K-means is O(kni) where k is the number of clusters, n is the number of points and i is the number of iterations. K-means is a relatively simple and efficient clustering algorithm, but has some drawbacks: The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run validation checks for determining the number of clusters in the data set. k-means tends to find clusters of comparable spatial extent (like in Figure 5). As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial FP7-ICT-ICT Call 8 Project No

47 clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. However, in the worst case, k-means can be very slow to converge. Figure 5: Example of k-means producing comparably-sized clusters A lot of variants exist that try to mitigate these limitations: Fuzzy c-means clustering (FCM), proposed by Dunn [76] and later improved by Bezdek [33], is a soft version of k-means. In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster. The degree of belonging, w k (x), is related inversely to the distance from x to the cluster center as calculated on the previous pass. It also depends on a parameter m that controls how much weight is given to the closest center. Gaussian mixture models trained with expectation-maximization algorithm (EM algorithm) [193] maintain probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means. FP7-ICT-ICT Call 8 Project No