DiNoDB: Efficient Large-Scale Raw Data Analytics
|
|
|
- Moses Lee
- 10 years ago
- Views:
Transcription
1 DiNoDB: Efficient Large-Scale Raw Data Analytics Yongchao Tian Eurecom Anastasia Ailamaki EPFL Ioannis Alagiannis EPFL Pietro Michiardi Eurecom Erietta Liarou EPFL Marko Vukolić Eurecom ABSTRACT Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data. In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB s centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data. Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB. Categories and Subject Descriptors H.2.4 [Database Management]: System Query processing Keywords Distributed database; In situ query; positional map file 1. INTRODUCTION In recent years, modern large-scale data analysis systems have flourished. For example, systems such as Hadoop and Spark [12, 4] focus on issues related to fault-tolerance and expose a simple yet elegant parallel programming model that hides the complexities of synchronization. Moreover, the batch-oriented nature of such system has been complemented by additional components (e.g., [5, 2]) that offer (near) real-time analytics on data streams. The communion of these two approaches is now commonly known as the Lambda Architecture (LA) [16]. The LA is split into three layers, a) the batch layer for managing and pre-processing append-only set of raw data, b) the serving layer that indexes the batch views so that we can support ad-hoc queries with low latency and c) the speed layer that handles all low latency requirements using fast and incremental algorithms over recent data only. Armed with new systems to store and process data, users needs recently grew in complexity as well. For instance, modern data analysis involves operations that go beyond extract-transform-load (ETL) or aggregation workloads often employing statistical learning techniques to, e.g., build data models and cluster similar data. As an illustrative use case, we take the perspective of a user (e.g., a data scientist) focusing on a data clustering problem. In such a scenario, our user typically faces the following issues: i) clustering algorithms (e.g., k-means [14], DBSCAN [11]) require parameter tuning and an appropriate distance function; and ii) computing clustering quality typically requires a trial-and-error process whereby test data is assigned to clusters and only domain-knowledge can be used to discern a good clustering from a bad one. In practice, even such a simple scenario illustrates a typical development workflow which involves: a batch processing phase (e.g., running k-means), an interactive query phase on temporary data (i.e., on data interesting in relatively short periods of time), and several iterations of both phases until algorithms are properly tuned and final results meet users expectations. Our system, called DiNoDB, explicitly tackles such development workflows. Unlike current approaches, which generally require a long and costly data loading phase that considerably increases the data-to-query latency, DiNoDB allows querying raw data in-situ, and exposes a standard SQL interface to the user making query analysis easier and effectively removing one of the main operational bottlenecks of data analysis. In particular, DiNoDB can be seen as a distributed instantiation of the NoDB paradigm [3], which is
2 specifically designed with in-situ interactive queries in mind (we briefly revisit NoDB [3] in Section 2). The main principle of DiNoDB is avoiding the traditional loading phase on temporary data; indeed, the traditional data loading phase makes sense when the workload (data and queries) is stable and critical in the long term. However, since data loading may include creating indexes, serialization and parsing overheads to truly accelerate query processing, it is reasonable to question its validity when working on temporary data. The key design idea behind DiNoDB is that of shifting the part of the burden of a traditional load operation to the batch processing phase of a development workflow. While batch data processing takes place, DiNoDB piggybacks the creation of distributed positional maps and vertical index files; such auxiliary files (DiNoDB metadata) are subsequently used by a cluster of DiNoDB nodes to improve the performance of interactive user queries on the temporary data. Interactive queries operate directly on raw data files produced by the batch processing phase, which are stored, for example, on a distributed file system such as HDFS [15], or directly in memory. Our experimental evaluation indicate that, for such workloads, DiNoDB systematically outperforms current solutions, including HadoopDB [1], Hive [13] and Shark [18]. In summary, our main contributions in this paper include: The design of the DiNoDB architecture, which colocates with the Hadoop MapReduce framework. DiNoDB is a distributed instantiation of the NoDB paradigm, and provides efficient, in-situ SQL-based querying capabilities; A system performance evaluation and comparative analysis of DiNoDB versus state-of-the-art systems including HadoopDB, Hive and Shark. The rest of the paper is organized as follows. In Section 2, we give a short introduction to NoDB. The DiNoDB architecture and its detailed components information are covered in Section 3. Section 4 presents our experimental evaluation demonstrating the performance of DiNoDB. We briefly overview related work in Section 5 and conclude in Section BACKGROUND ON NODB NoDB [3] is a database systems design paradigm that proposes the conceptual approach for building database systems tailored to querying raw files (i.e., in-situ). As a centralized instantiation of the paradigm, [3] presents a variant of PostgreSQL called PostgresRaw. The main advantage of PostgresRaw, compared to other traditional DBMS, is that it avoids the data loading phase and proceeds directly to the querying phase once data becomes available. In a traditional DBMS, loading the whole data inside the database is unavoidable, even if few attributes in a table are needed; the loading process however might take hours (or more) for large amounts of data. NoDB on the other hand, adopts in-situ querying instead of loading and preparing the data for queries. To speed up query executions on raw data files, PostgresRaw works in two directions: it minimizes raw data access cost and reduces raw data access. To minimize raw data access cost, PostgresRaw tokenizes only necessary attributes and parses only qualified tuples. PostgresRaw builds an auxiliary structure called positional map, which contains relative positions of attributes in a Push query to DiNoDB nodes and aggregate results D inodb node D DiNoDB Client D inodb node Meta Connector inodb node Datanode Datanode Datanode Fetch file block location information HDFS Figure 1: DiNoDB architecture Namenode line, and updates it during the queries. With positional maps, once processed attributes can be later obtained directly without scanning the entire data records. If requested attributes are omitted in the positional map, a nearby attribute position can be used to navigate faster to the requested attributes. Furthermore, to reduce raw data accesses, PostgresRaw also contains a data cache that temporarily holds the previously accessed data. The most expensive query for PostgresRaw is the first query for which PostgresRaw needs to scan the whole raw file because it does not have any auxiliary positional map or cache that would be used to speed-up the query. However, even the first PostgresRaw query has lower latency than the loading phase in other traditional DBMSs, which involves the scanning of the entire file [3]. As the positional map and cache grow, after several queries, query latency in Postgres- Raw will be comparable to or sometimes even better than that of traditional DBMSs. Overall, PostgresRaw often outperforms classical DBMSs in latency of raw data queries. 3. DISTRIBUTED NODB (DiNoDB) In this section, we present Distributed NoDB (DiNoDB) in details. DiNoDB is designed for interactive query analytics on large volumes of raw, unloaded data, as well as to be integrated into batch processing frameworks such as the MapReduce/Hadoop framework. 3.1 High-level design At a high level (see Figure 1), DiNoDB consists of a set of PostgresRaw instances (used for interactive queries) combined with the Hadoop framework (used for batch processing). Our DiNoDB prototype is inspired by HadoopDB [1]: it orchestrates PostgresRaw instances using the Hadoop framework itself, relying on existing connectors that plug into PostgresSQL. As a result of such a high-level design, raw data that is to be queried in DiNoDB resides in the Hadoop Distributed File System (HDFS)[15]. DiNoDB ensures data locality by co-locating DiNoDB nodes (described in Section 3.4) with HDFS data nodes, as shown in Figure 1. Next, we describe in more detail the DiNoDB pre-processing (batch processing) phase, as well as DiNoDB components, namely DiNoDB clients and DiNoDB nodes. 3.2 Pre-processing (batch processing) phase The batch processing phase in a development workflow typically involves the execution of (sophisticated) analytics
3 algorithms. This phase is accomplished with one or more MapReduce jobs, whereby output data is written to HDFS by reducers in a key-value format. DiNoDB leverages the batch processing phase as a pre-processing phase for future interactive queries. Namely, DiNoDB piggybacks the generation of auxiliary metadata, including positional maps and a vertical index to the batch processing phase. Complementary to positional maps, vertical indices in DiNoDB allow users to specify one attribute as a key attribute. An entry in the vertical index has two fields for each record: the key attribute value and the record row offset value. More details about vertical indices can be found in Sec 3.4. DiNoDB metadata is created with user defined functions (UDFs) executed by reducers.this guarantees the co-location of raw data files and their associated DiNoDB metadata, since reducers always first write their output on the local machine in which they execute. In addition to metadata generation, DiNoDB capitalizes on data pre-processing to store output data in-memory: this can be done implicitly or explicitly. In the first case, raw output data is stored in the file system cache of DiNoDB nodes (i.e., HDFS DataNodes), which uses an approximate form of an LRU eviction policy: recently written output data reside in memory. Alternatively output data can be explicitly stored in RAM, using the ramfs file system as an additional mount point for HDFS. 1 With explicit inmemory data caching, DiNoDB avoids the drawbacks of the implicit file system caching scheme in multi-tenant environments where multi-tenancy might cause cache evictions. As we demonstrate in Section 4, the combination of metadata and in-memory data storage substantially contributes to DiNoDB query execution performance. 3.3 DiNoDB clients A DiNoDB client serves as entry point for DiNoDB interactive queries: it provides a standard shell command interface, hiding the network layout and the distributed system architecture from users. As such, applications can use DiNoDB just like a traditional DBMS. DiNoDB clients accept application requests, and communicate with DiNoDB nodes. When a DiNoDB client receives a query, it fetches related metadata for the tables (raw files) indicated in the query, using the MetaConnector module. The MetaConnector (see Figure 1) is a proxy between DiNoDB and the HDFS NameNode, and is responsible for retrieving HDFS metadata information like partitions and block locations of raw data files. Using this metadata, the MetaConnector guides DiNoDB clients to query DiNoDB nodes that hold raw data files relevant to user queries. Additionally, the MetaConnector remotely configures DiNoDB nodes so that they can build the mapping between tables and the related blocks, including all data file blocks, positional map blocks and vertical index blocks. In summary, the anatomy of a query execution is as follows: (i) using the MetaConnector, a DiNoDB client learns the location of every raw file blocks and pushes the query to the respective DiNoDB nodes; (ii) DiNoDB nodes process the query in parallel; and finally, (iii) the DiNoDB client aggregates the result. 1 This technique has been independently considered for inclusion in a recent patch to HDFS [8]. DiNoDB node HDFS Datanode D Data block 1 Data_block_2 Data_block_3 Data cache Positional map Vertical index PM_block_1 PM_block_2 (a) DiNoDB node internals Row Row 1 Row 2 Row n Key attribute Value_ Value_1 Value_2 Value_n Row offset Offset_ Offset_1 Offset_2 Offset_n (b) DiNoDB vertical index Figure 2: DiNoDB node Note that, since DiNoDB nodes are co-located with the HDFS DataNodes, DiNoDB inherits fault-tolerance from HDFS replication. If a DiNoDB client detects a failure of a DiNoDB node, or upon the expiration of a timeout on DiNoDB node s responses, the DiNoDB client will issue the same query to another DiNoDB node holding a replica of the target HDFS blocks. 3.4 DiNoDB nodes DiNoDB nodes instantiate customized PostgresRaw databases which execute user queries, and are co-located with HDFS DataNodes (see Figure 2(a)). DiNoDB nodes leverage positional map files generated in the DiNoDB pre-processing phase (see Section 3.2) when executing queries. DiNoDB strives at keeping metadata small: indeed, in some cases, a positional map may contain the position of all attributes present in a raw data file, which could cause metadata to be as large as the data to query, thus nullifying its benefit. To do so, DiNoDB uses a sampling technique to store only a few attributes, depending on a user-provided sampling rate. From the query latency perspective, an approximate positional map still provides tangible benefits: when a query hits an attribute in the sampled map, its benefits are clear; otherwise, DiNoDB nodes use sampled attributes as anchor points, to proceed with a sequential scan of nearby attributes to satisfy a query. When the nature of interactive queries is known in advance, DiNoDB users can additionally specify a single key attribute, which is used to create a vertical index file (see Figure 2(b)). Vertical index generation (just like that of positional maps) is piggybacked to the batch-processing phase. When available, DiNoDB nodes load vertical index files, which contain the key attribute value and its row offset in the raw data file. This technique contributes to improved system performance: if a user query hits the key attribute, DiNoDB nodes perform an index scan instead of full sequential scan of the raw data, which considerably saves disk reads. Note that both positional map and vertical index files are loaded by a DiNoDB node during the first query execution. As our performance evaluation shows (see Section 4), the metadata load time is considerably small, when compared to the execution time of the first query: as such, only the first query represents a bottleneck for DiNoDB (a cost that the NoDB paradigm also has to bear). Optimizations in DiNoDB nodes. In the vanilla PostgresRaw [3] implementation, a table maps to a single raw
4 data file. Since the HDFS files are instead split into multiple blocks, DiNoDB nodes use a customized PostgresRaw instance, which features a new file reader that can map a table to a list of raw data file blocks. In addition, the vanilla PostgresRaw implementation is a multiple-process server, which forks a new process for each new client session, with individual metadata and data cache per process. Instead, PostgresRaw nodes place metadata and data cache in shared memory, such that user queries which are sent through the DiNoDB client can benefit from them across multiple sessions. An additional feature of DiNoDB nodes is that they access raw data files bypassing the HDFS client API. This design choice provides increased performance by avoiding the overheads of a Java-based interface. More importantly, as discussed in Section 3.2, DiNoDB users can selectively indicate whether raw data files are placed on disk or in memory. Hence, DiNoDB nodes can seamlessly benefit from the native file system cache, or from a memory-backed file system to dramatically decrease query times. 4. EVALUATION Next, we present an experimental analysis of DiNoDB. First, in Section 4.1 we study the performance of DiNoDB in the context of interactive queries and compare DiNoDB to Shark [18] (version.8.), Hive [13] (version.9.) and PostgreSQL-backed HadoopDB [1]. Then, in Section 4.2 we evaluate the benefits of pre-generating DiNoDB metadata (i.e., positional maps and vertical indices) as well as the benefits of the in-memory cache, that is populated in the pre-processing phase (see Sec. 3.2). Finally, in Section 4.3, we compare the performance of DiNoDB to that of other systems using a dataset that does not fit in memory. All experiments are conducted on a cluster with 6 machines, with 4 cores, 16 GB RAM and 1 Gbps network interface each. The underlying distributed file system is HDFS, configured with one NameNode and five DataNodes. 4.1 Comparative performance evaluation In this section, we proceed with a comparative analysis using two kinds of query templates: random and key attribute queries. The raw data file we use in our experiments contains 3.6 * 1 7 tuples. Each tuple has 15 attributes, which are all integers distributed randomly in the range [-1 9 ). The size of the raw file is 5 GB. The dataset is stored in RAM, using the ramfs file system available on each machine. Additional details are as follows. DiNoDB uses a 2.2 GB positional map (generated with a 1/1 sampling rate) and a 72 MB vertical index file. HadoopDB uses PostgreSQL as a local database, co-located with HDFS DataNodes. To minimize loading time in HadoopDB, we load bulk data into PostgreSQL directly without using HadoopDB s Data Loader which requires to repartition data. Finally, we evaluate Shark by both loading the dataset in its internal caching mechanism before issuing queries (that we label Shark cached), and using the default, disk-based mechanism, (that we call Shark). Random queries. A sequence of five SELECT SQL queries are executed in all systems. Each query targets one random attribute with 1 selectivity. The result is shown in Figure 3. Hive has the worst performance since each query repeatedly scans the whole data files. It spends more than Execution Time (sec) Hive HadoopDB Shark_cached Shark DiNoDB Figure 3: DiNoDB vs. other distributed systems: Random query 2 seconds for each query. HadoopDB has a very short query latency, but it only happens after 2 seconds loading time. Shark cached enjoys faster loading than HadoopDB because it achieves a maximum degree of parallelism to load data into memory at the aggregated throughput of the CPU [18]. HadoopDB could also support a high degree of parallelism to load the data, but that would require data repartitioning on each datanode, which would require extra effort. On the other hand, when queries are executed in Shark in situ, i.e., without caching/loading, then each query latency is about 43 seconds. Finally, DiNoDB achieves the best performance with an average query latency of 24 seconds. This can be explained by the fact that DiNoDB avoids explicitly the loading phase and leverages positional maps to speed-up queries on raw files. Key attribute based queries. In this experiment, we measure the effect of vertical indices on the overall query latency. To this end, we execute range queries on the key attribute. The result is shown in Figure 4. Hive supports index scan and needs more than 4 seconds to build the index. However, we register a query latency that is worse than Hive with a sequential scan (depicted in Figure 4). In HadoopDB, building an index on our dataset requires about 19 seconds. Overall, HadoopDB requires 223 seconds before any query can be executed, although each query has less than 1 seconds latency overall. Shark does not have any indexing mechanism, so the performance we register is similar to our experiments with random queries. DiNoDB achieves the best performance: the average query latency is about 11 seconds, which is less than half of the latency for random queries. Discussion. In our current prototype a DiNoDB node only uses a single process with one thread. In contrast, Shark can use all available cpus on loading and processing data. 2 Hence, Shark cached has low per-query latency (after loading) which makes it appealing when there are many queries 2 A more fair comparison between DiNoDB and Shark would involve restricting Shark to using only a single core; such a single-core Shark in our experience suffers a threefold (3x) latency increase compared to the performance reported in Figures 3 and 4. load
5 Execution time (sec) Hive HadoopDB Shark_cached Shark DiNoDB 53 buildindex load Excution time (sec) Figure 4: DiNoDB VS. other Distributed systems: Key value query DiNoDB PostgresRaw_ramfs PostgresRaw_disk (e.g., when data of interest is not temporary). Moreover, since DiNoDB is coupled to the Hadoop framework, it suffers from the its job scheduling overhead, which is about 7 seconds per query. In our future work, we aim at eliminating these sources of overhead, although globally, DiNoDB outperforms all other systems for the development workloads we target in this work. 4.2 Benefits of pre-processing In this experiment, we evaluate the benefits of the DiNoDB pre-processing phase (Sec. 3.2), where we generate DiNoDB metadata. Indeed, one major difference between a vanilla PostgresRaw instance and DiNoDB is that the latter loads positional map and vertical index files before executing the first query. As such, what we really measure is the overhead related to the metadata loading phase. Note that in the following we use a 1/1 sampling rate to generate a DiNoDB positional map. We run the next experiments in a single-machine deployment: thus, we compare the performance of a vanilla PostgresRaw instance to that of a single DiNoDB node. In both cases, the two systems execute the same sequence of queries on a 1GB raw file. All the queries are SQL SELECT statements, which have the selection condition in the WHERE clause and randomly project one attribute. Additionally, we also study the impact of raw data being in memory or on disk. We do this using vanilla PostgresRaw, which allow us to isolate the effects of metadata from the benefit of a faster storage tier. We denote the two cases by PostgresRaw ramfs and PostgresRaw disk, respectively. Figure 5 shows our results. We can see that for PostgresRaw disk, the first query has a latency of about 143 seconds: in this case, the disk is the bottleneck. The first query in PostgresRaw ramfs takes about 53 seconds. Instead, a DiNoDB node executes the first query in 13 seconds, of which.5 seconds are related to loading metadata files. For all subsequent queries, all three systems have similar performance, because data reside in memory. Note that more queries imply richer positional maps, and hence lower latencies. In summary, the experiments shown in this Section validate the design choices in DiNoDB: if possible, raw data should reside in ram, to avoid paying the price of slow disk access speed. Additionally, piggybacking the generation of metadata files in the pre-processing phase, which are then loaded in memory before query execution only marginally affects the overall query latency. Execution time (sec) Figure 5: Benefits of pre-processing Hive HadoopDB Shark DiNoDB Figure 6: 25 GB dataset experiment loading 4.3 Large dataset test We now study the performance of different systems when the dataset does not fit in memory. We use a 25 GB raw data file that has been generated similarly to the raw file in Section Section 4.1 (albeit with more tuples), and the same random queries defined in Section 4.1. The results are shown in Figure 6, where we omit latencies for Shark cached, for obvious reasons. HadoopDB achieves the smallest query latencies, at the cost of a 5 minutes loading phase. Hive, Shark and DiNoDB have similar performance because they are all bound by disk I/O. Overall, DiNoDB has the smallest aggregate query latency. Clearly, DiNoDB faces a tipping point : when the number of queries on temporary data is small enough, aggregate query latencies are in its favor; the situation changes when the number of queries exceed a threshold. A better characterization of this threshold is part of our future work. 5. RELATED WORK Several research works and commercial products complement the batch processing nature of Hadoop/MapReduce [12, 7] with systems to query large-scale data at interactive speed using a SQL-like interface. Examples of such systems include HadoopDB [1], Vertica [17], Hive [13] or Impala [6]. These systems require data to be loaded before queries can be executed: in workloads for which data-to-query time matters, for example due to the ephemeral nature of the data at
6 hand, the overheads due to the load phase, crucially impact query performance. In [2] the authors propose the concept of invisible loading for HadoopDB as a technique to reduce the data-to-query time; with invisible loading, the loading to the underlying DBMS happens progressively. In contrast to such systems, DiNoDB avoids data loading and is tailored for querying raw data files leveraging positional maps and vertical indices. Shark [18] presents an alternative design: it relies on a novel distributed shared memory abstraction [19] to perform most computations in memory while offering fine-grained fault tolerance. Shark builds on Hive [13] to translate SQLlike queries to execution plans running on the Spark system [4], hence marrying batch and interactive data analysis. To achieve short query times, Shark requires data to be resident in memory: as such, the output of batch data processing which is generally materialized to disk for failure tolerance need to be first loaded in RAM. As our experimental results indicate, such data caching can be a costly operation. PostgresRaw [3] is a centralized DBMS that avoids data loading and transformation prior to queries. As we detailed in the previous sections, DiNoDB leverages PostgresRaw as a building block, effectively distributing and scaling-out PostgresRaw, while integrating it with the Hadoop batch processing framework. Unlike PostgresRaw, DiNoDB does not generate positional maps on the fly, but leverages the Hadoop batch processing phase to pre-generate positional maps and vertical indices to speed up the query execution. Finally, DiNoDB shares some similarities with several research work that focuses on improving Hadoop performance. For example, Hadoop++ [9] modifies the data format to include a Trojan Index so that it can avoid full file sequential scan. Furthermore, CoHadoop [1] co-locates related data files in the same set of nodes so that a future join task can be done locally without transfering data in network. However, neither Hadoop++ nor CoHadoop are specifically tuned for interactive raw data analytics, like DiNoDB is. 6. CONCLUSION We proposed a design of DiNoDB, a distributed database system tuned for interactive queries on raw data files generated by large-scale batch processing frameworks such as Hadoop/MapReduce. Our preliminary evaluation suggest that DiNoDB has very promising performance targeting iterative machine learning data processing, where few interactive queries are performed in between iterative batch processing jobs. In such use cases, we showed that DiNoDB outperforms stat-of-the-art systems, including Apache Hive, HadoopDB and Shark. In future work, we aim at further improving the performance of DiNoDB, notably by removing the overhead related to Hadoop orchestration framework we use (currently HadoopDB), as well to evaluate our system on a wider set of datasets and workloads. Acknowledgements. This work is partially supported by the EU project BigFoot (FP7-ICT-22385). [2] A. Abouzied and et. al. Invisible loading: Access-driven data transfer from raw files into database systems. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT 13, 213. [3] I. Alagiannis and et. al. Nodb: efficient query execution on raw data files. In Proceedings of the 212 ACM SIGMOD International Conference on Management of Data, SIGMOD 12, 212. [4] Apache Spark. Webpage. [5] Apache Storm. Webpage. [6] Cloudera Impala. Webpage. products-and-services/cdh/impala.html/. [7] J. Dean and et. al. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI 4, 24. [8] Discardable Distributed Memory: Supporting Memory Storage in HDFS. Webpage. [9] J. Dittrich and et. al. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 21. [1] M. Y. Eltabakh and et. al. Cohadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow., 211. [11] M. Ester and et. al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, [12] Hadoop. Webpage. [13] Hive. Webpage. [14] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, [15] K. Shvachko and et. al. The hadoop distributed file system. In Proceedings of the 21 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST 1, 21. [16] The Lambda Architecture. Webpage. [17] Vertica. Webpage. [18] R. S. Xin and et. al. Shark: Sql and rich analytics at scale. In Proceedings of the 213 ACM SIGMOD International Conference on Management of Data, SIGMOD 13, 213. [19] M. Zaharia and et. al. Spark: Cluster computing with working sets. In HotCloud, 21. [2] M. Zaharia and et. al. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, REFERENCES [1] A. Abouzeid and et. al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 29.
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks
Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks Haoyuan Li UC Berkeley Outline Motivation System Design Evaluation Results Release Status Future Directions Outline Motivation
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here
PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Actian Vector in Hadoop
Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
NoDB: Efficient Query Execution on Raw Data Files
NoDB: Efficient Query Execution on Raw Data Files Ioannis Alagiannis Renata Borovica Miguel Branco Stratos Idreos Anastasia Ailamaki EPFL, Switzerland {ioannis.alagiannis, renata.borovica, miguel.branco,
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @
The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
Navigating the Big Data infrastructure layer Helena Schwenk
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
: Flexible Data Placement and Its Exploitation in Hadoop 1 Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan Rainer Gemulla +, Aljoscha Krettek #, John McPherson IBM Almaden Research Center, USA {myeltaba,
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Sujee Maniyam, ElephantScale
Hadoop PRESENTATION 2 : New TITLE and GOES Noteworthy HERE Sujee Maniyam, ElephantScale SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member
2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
