High Performance Spatial Queries for Spatial Big Data: from Medical Imaging to GIS
|
|
- Charlene Norton
- 2 years ago
- Views:
Transcription
1 High Performance Spatial Queries for Spatial Big Data: from Medical Imaging to GIS Fusheng Wang 1, Ablimit Aji 2, Hoang Vo 3 1 Department of Biomedical Informatics, Department of Computer Science, Stony Brook University, USA 2 HP Labs, USA 3 Department of Mathematics and Computer Science, Emory University, USA Abstract Support of high performance queries on large volumes of spatial data has become increasingly important in many application domains, including geospatial problems in numerous disciplines, location based services, and emerging medical imaging applications. There are two major challenges for managing massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. Our goal is to develop a general framework to support high performance spatial queries and analytics for spatial big data on MapReduce and CPU-GPU hybrid platforms. In this paper, we introduce Hadoop-GIS a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through skew-aware spatial partitioning, on-demand indexing, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. To accelerate compute-intensive geometric operations, GPU based geometric computation algorithms are integrated into MapReduce pipelines. Our experiments have demonstrated that Hadoop-GIS is highly efficient and scalable, and outperforms parallel spatial DBMS for compute-intensive spatial queries. 1 Introduction The proliferation of cost effective and ubiquitous positioning technologies has enabled the capturing of spatially oriented data at an unprecedented scale and rate. Volunteered Geographic Information (VGI) such as OpenStreetMap [1] further accelerates the generation of massive spatial information from community users. Analyzing large amounts of spatial data to derive values and guide decision making has become essential to business success and scientific discovery. The rapid growth of spatial data has been driven by not only industrial applications, but also emerging scientific applications that are increasingly data- and compute- intensive. With the rapid improvement of data acquisition technologies, it has become more efficient to capture extremely large spatial data to support scientific research. For example, digital pathology imaging has become an emerging field in the past decade, where examination of high resolution scanned images of tissue specimens enables novel and more effective methods for disease diagnosis and therapy. Pathology image analysis offers a means of rapidly carrying out quantitative, reproducible measurements of micro-anatomical features in high-resolution images. Regions of micro-anatomic objects such as nuclei and cells are computed through image segmentation algorithms, represented with their boundaries, and image features are extracted from these objects. Exploring the results of such analysis involves complex queries such as spatial cross-matching or overlay, spatial proximity computations between objects,
2 and queries for global spatial pattern discovery. These queries often involve billions of spatial objects and extensive geometric computations. For example, spatial cross-matching is often used to compare and evaluate image segmentation algorithm results [14] (Figure 1). In particular, the spatial cross-matching/overlay problem involves identifying and comparing objects belonging to a wide range of different observations. Figure 1: Examples spatial query cases. (a) pathology imaging; (b) GIS applications. A major requirement for the data intensive spatial applications is fast query response which requires a scalable architecture that can query spatial data on a large scale. Another requirement is to support queries on a cost effective architecture such as commodity clusters or cloud environments. With the rapid improvement of instrument resolutions, increased accuracy of data analysis methods, and the massive scale of observed data, complex spatial queries have become increasingly data- and compute-intensive. A typical whole slide pathology image contains more than 100 billion pixels, millions of objects, and 100 million derived image features. A single study may involve thousands of images analyzed with dozens of algorithms - with varying parameters - to generate many different result sets to be compared and consolidated, at the scale of tens of terabytes. In addition, the aforementioned VGI also enables fast and massive geospatial data collection. Besides the data scale challenge, most spatial queries involve geometric computations that are frequently compute-intensive. While the spatial filtering using minimum bounding boxes (MBBs) can be accelerated through spatial access methods, spatial refinements such as polygon intersection verification are highly expensive operations. For instance, spatial join queries such as spatial cross-matching or spatial overlay could require significant numbers of CPU operations to process. This is mainly due to the polynomial complexity of many geometric computation methods. Such compute-intensive geometric computation, combined with the big data challenge, poses significant challenges to efficient spatial applications. There is major demand for viable spatial big data solutions from diverse fields. Traditional spatial database management systems (SDBMSs) have been used for managing and querying spatial data, through extended spatial capabilities on top of object-relational database management systems. These systems often have major limitations on querying spatial data at massive scale, although parallel RDBMS architectures [9] are available. Parallel SDBMSs tend to reduce the I/O bottleneck through data partitioning but are not optimized for compute intensive operations such as geometric computations. Furthermore, parallel SDBMS architecture often lacks effective spatial partitioning mechanism to balance data and task loads across partitions. The high data loading overhead is another major bottleneck for SDBMS based solutions [10, 13, 14]. In contrast, MapReduce based computing model provides a highly scalable, reliable, elastic and cost effective framework for storing and processing massive data on a cluster or in cloud environment. While the MapReduce model fits amiably with large scale problems through its key-based partitioning, spatial queries and analytics are intrinsically complex and difficult to adapt into this model due to its multi-dimensional nature. Spatial partitioning poses two major problems to be handled: spatial data skew problem and boundary object
3 problem. The first could lead to load imbalance of tasks in distributed systems and thus result in long query response time, and the second could lead to incorrect query results if not handled properly. Furthermore, spatial query methods have to be adapted so that they can be mapped into partition based query processing framework while preserving the correct query semantics. Spatial queries are also intrinsically complex which often rely on effective access methods to reduce the search space and alleviate the high cost of geometric computations. Meanwhile, hybrid systems combining CPUs and GPUs are becoming commonly available in commodity clusters, but the computational capacity of such systems is often underutilized. There is a general trend towards a simplified programming model such as MapReduce and hybrid computing architectures for processing massive data, but there is a significant research gap in developing new spatial querying and analytical methods to run on such architectures. We have developed Hadoop-GIS [2, 3, 4, 5, 6, 12] a spatial data warehousing system over MapReduce, to support highly scalable and efficient spatial queries and anaytics on large scale data. Hadoop-GIS provides a framework to parallelize multiple types of spatial queries and convert them into MapReduce based query pipelines. Specifically, Hadoop-GIS offers data skew aware spatial data partitioning to achieve task parallelization, an indexing-driven spatial query engine to process spatial queries, implicit query parallelization through MapReduce, and boundary object handling to generate accurate results. In addition, we have evolved GPU based spatial operators to accelerate heavy duty geometric computation, and integrated them into MapReduce based query pipelines. 2 Overview The main objective of Hadoop-GIS is to provide a highly scalable, cost-effective, efficient spatial query processing system for data- and compute-intensive spatial applications, that can take advantage of MapReduce running on commodity clusters and CPU-GPU hybrid platforms. We first create new spatial data processing methods and pipelines with spatial partition level parallelism through the MapReduce programming model, and develop multi-level indexing methods to accelerate spatial data processing. We provide two critical components to enable such partition based parallelism by investigating effective and scalable spatial partitioning in MapReduce (pre-processing), and query normalization methods. To maximize execution performance, we fully exploit both thread-level and data-level parallelisms and utilize SIMD (Single Instruction Multiple Data) vector units to parallelize spatial operations to support object level (via grouping of many objects) and intra-object level parallelism (via breaking down an object into many smaller components), and integrate them into MapReduce pipelines. In MapReduce environment, we propose the following steps on running a typical spatial query, as shown in Algorithm 1. In step A, we effectively partition the input to generate tiles. In step B, we assign tile UIDs to spatial objects and store the objects in the Hadoop Distributed File System (HDFS). In step C we pre-process the query, and perform a preliminary filtering based on the global region index derived from the data partitioning in step A. In step D, we perform a tile based spatial query processing in which tiles run as independent MapReduce tasks in parallel. In step E, we process the boundary objects to remove duplicate objects and normalize the query result. In step F, we perform a post processing required for certain spatial query types. In step G, we perform aggregations and any additional operators, and output results to HDFS. 2.1 Real-time Spatial Query Engine A fundamental component of Hadoop-GIS is its standalone spatial query engine. Porting a spatial database engine for such purpose is not feasible, due to its tight integration with RDBMS engine and complexity on setup and optimization. We developed a Real-time Spatial Query Engine (RESQUE) to support spatial query processing. RESQUE takes advantage of global tile indexes and local on-demand indexes to support efficient spatial queries. In addition, RESQUE is fully optimized, supports data compression, and incurs very low overhead on data loading. Thus, RESQUE is a highly efficient spatial query engine compared to a traditional SDBMS engine.
4 Algorithm 1: Typical workflow of spatial query processing on MapReduce A. Data/space partitioning; B. Data storage of partitioned data on HDFS; C. Pre-query processing (optional); D. for tile in input collection do Index building for objects in the tile; Tile based spatial querying processing; E. Boundary object handling; F. Post-query processing (optional); G. Data aggregation; H. Result storage on HDFS; RESQUE is compiled as a shared library which can be easily deployed in a cluster environment. Hadoop-GIS takes advantage of spatial access methods for query processing with two approaches. At the higher level, Hadoop-GIS creates global region based spatial indexes of partitioned tiles for HDFS file split filtering. Consequently, for many spatial queries such as containment queries, the system can efficiently filter most irrelevant tiles through this global region index. The global region index is small and can be stored in HDFS and shared across cluster nodes through Hadoop distributed cache mechanism. At the tile level, RESQUE supports an indexing on demand approach by building tile based spatial indexes on the fly, mainly for query processing purpose, and storing index files in the main memory. Since the tile size is relatively small, index building on a single tile is fast and significantly improves spatial query processing performance. Our experiments show that index building consumes very small fraction of overall query processing cost, and it is negligible for compute-and data-intensive queries such as cross-matching. 2.2 MapReduce Based Parallel Query Execution Instead of using explicit spatial query parallelization as summarized in [7], we take an implicit parallelization approach by leveraging MapReduce. This will much simplify the development and management of query jobs on clusters. As data is spatially partitioned, the tile name or UID forms the key for MapReduce, and identifying spatial objects of tiles can be performed in mapping phase. Depending on the query complexity, spatial queries can be implemented as map functions, reduce functions or combination of both. Based on the query types, different query pipelines are executed in MapReduce. As many spatial queries involve high complexity geometric computations, query parallelization through MapReduce can significantly reduce query response time. 2.3 Boundary Object Handling In the past, two approaches were proposed to handle boundary objects in a parallel query processing scenario, namely Multiple Assignment and Multiple Matching [16]. In Multiple Assignment, the partitioning step replicates boundary crossing objects and assigns them to multiple tiles. In Multiple Matching, partitioning step assigns an boundary crossing object to a single tile, but the object may appear in multiple tile pairs for spatial joins. While the Multiple Matching approach avoids storage overhead, a single tile may have to be read multiple times for query processing, which could incur increase in both computation and I/O. The Multiple Assignment approach is simple to implement with no modification to spatial computation algorithms and fits nicely to the MapReduce programming model. For example, spatial join on tiles with Multiple Assignment based partitioning can be corrected by eliminating duplicated object pairs from the query result set, which can be implemented as an additional MapReduce job [8, 16].
5 3 Spatial Data Partitioning Geospatial data tends to be heavily skewed. For example, if OpenStreetMap is partitioned into 1000 x 1000 fixed size tiles, the number of objects contained in the most skewed tile is nearly three orders of magnitude more than the one in an average tile. Such large skewed tiles could significantly increase the response time in a parallel computing environment due to the straggling tiles. Thus effective and efficient spatial data partitioning is essential for scalable spatial queries running in MapReduce. Spatial partitioning approaches generate boundary objects that cross multiple partitions, thus violating the partition independence. Spatial query processing algorithms get around the boundary problem by using a replicate-and-filter approach [9, 16] in which boundary objects are replicated to multiple spatial partitions, and side effects of such replication is remedied by filtering the duplicates at the end of the query processing phase. This process adds extra query processing overhead proportional to the number of boundary objects. Therefore, a good spatial partitioning approach should minimize the number of boundary objects. We develop SATO [12], an effective and scalable partitioning framework which produces balanced regions while minimizing the number of boundary objects. The partitioning methods are designed for scalability, which can be easily parallelized for high performance. SATO stands for four main steps in the partitioning pipeline: Sample, Aanalyze, Tear, and Optimize. First, a small fraction of the dataset is sampled to identify overall global data distribution with potential dense regions. Next, the sampled data is analyzed to produce a coarse partition scheme in which each partition region is expected to contain roughly equal amounts of spatial objects. Then these coarse partition regions are passed to the partitioning component that tears the regions into more granular partitions satisfying the partition requirements. Finally, the generated partitions are analyzed to produce multilevel partition indexes and additional partition statistics which can be used for optimizing spatial queries. SATO integrates multiple partitioning algorithms that can handle diverse datasets, and each of the algorithm has its own merits [12]. SATO also provides MapReduce based implementation of the spatial partitioning methods through two alternative approaches: top-down approach with region level parallelization, and bottomup approach with object level parallelization. 4 MapReduce Based Spatial Query Processing RESQUE provides the core query engine to support spatial queries, which enables us to to develop a large scale spatial query processing framework based on MapReduce. Our approach is based on spatial data partitioning, tile based spatial query processing with MapReduce, and result normalization for tile boundary objects. 4.1 Spatial Join with MapReduce Spatial join is among the most frequently used and costly queries in many spatial applications. Next, we discuss how to map spatial join queries into the MapReduce computing model. We first show an example spatial join query for spatial cross-matching in SQL, as shown in Figure 2. This query finds all intersecting polygon pairs between two sets of objects generated from an image by two different algorithms, and computes the overlap ratios (intersection-to-union ratios) and centroid distances of the pairs. The table markup polygon represents the boundary as polygon, algorithm UID as algrithm uid. The SQL syntax comes with spatial extensions such as spatial relationship operator ST INTERSECTS, spatial object operators ST INTERSECTION and ST UNION, and spatial measurement functions ST CENTROID, ST DISTANCE, and ST AREA. For simplicity, we first present how to process the spatial join above with MapReduce ignoring boundary objects, then we return to discuss boundary handling. Input datasets are partitioned into tiles during the data loading phase, and each record is assigned a unique partition id. The spatial join query is implemented as a MapReduce query operator processed in following three steps: i) Map step: the input datasets are scanned for Map operator, and each mapper, after applying user defined function or filter operation, emits the records
6 1: SELECT 2: ST_AREA(ST_INTERSECTION(ta.polygon,tb.polygon))/ 3: ST_AREA(ST_UNION(ta.polygon,tb.polygon)) AS ratio, 4: ST_DISTANCE(ST_CENTROID(tb.polygon), 5: ST_CENTROID(ta.polygon)) AS distance, 6: FROM markup_polygon ta JOIN markup_polygon tb ON 8: ST_INTERSECTS(ta.polygon, tb.polygon) = TRUE 9: WHERE ta.algrithm_uid= A1 AND tb.algrithm_uid= A2 ; Figure 2: An example spatial join (cross-matching) query with their parition id as the key along with a tag to indicate which dataset the records belong to. ii) shuffle step: records are sorted and shuffled to group the records having the same key (same partition id), and the intermediate results are materialized to local disks. iii) Reduce step: each reducer will be assigned to process a single partition, and a spatial join processing algorithm, such as plane-sweep join or index based join, is used to process the single partition. The join algorithm used for processing the single partition can be an in-memory or a disk based depending on the size of the partition. In addition, during the execution, Hadoop-GIS constructs an in-memory R -Tree for each dataset in a partition, and uses those indexes to process spatial join query. 4.2 Support of other Query Types with MapReduce Other types of spatial queries follow a similar processing pattern as shown in Algorithm 1. Spatial selection or containment is a simple query type in which objects geometrically contained in selection region are returned. For example, in a medical imaging scenario, users may be interested in the cell features which are contained in a cancerous tissue region. Since data is organized in partitions, containment queries can be processed in a filter-and-refine fashion. In the filter step, partitions disjoint from the query region are excluded from further processing. In the refinement step, the candidate objects are checked with the precise geometry test. The global region index is used to generate a selective table scan operation which only scans the file splits potentially containing the query results. The query would be translated into a map only MapReduce program as shown in [6]. Support of multi-way spatial join queries and nearest neighbor queries follow a similar pattern and are discussed in [5]. For K-nearest neighbors search, Hadoop-GIS provides two algorithms for an application scenario where the query is processed over a set of query objects and the cardinality of one set of objects is much smaller than the other. For example, a query in pathology imaging would, for each stem cell, find the nearest blood vessel, compute the variation of intensity of each biological property associated with the cell in respect to the distance, and return the density distribution of blood vessels around each cell. In this case the number of cells is significantly larger than the number of blood vessels. Both algorithms [5] use a replication strategy to parallelize nearest neighbor queries. Specifically, the larger cardinality dataset is partitioned and distributed over HDFS, and mappers replicate the smaller cardinality dataset to each node. Each reducer builds an in-memory index structure, such as Voronoi diagram or R-Tree, on the smaller dataset, and processes the query over the larger dataset utilizing the index. 4.3 Boundary Handling In partition based spatial query processing, some spatial objects may lie on partition boundaries. As the partition size gets smaller, the percentage of boundary objects increases. In general, the fraction of boundary objects is inversely proportional to the size of the partition. Boundary objects pose the challenge that they belong logically to multiple disjoint partitions and would generate duplicate results. Hadoop-GIS remedies the boundary problem in a simple but effective way. If a query requires to return complete query result, Hadoop-GIS generates a query plan which contains a pre-processing task and a post-
7 processing task. In the pre-processing task, the boundary objects are duplicated and assigned to multiple intersecting partitions (multiple assignment). When each partition is processed independently during query execution, the results are not yet correct due to the duplicates. In the post-processing step, results from multiple partitions will be normalized, e.g., to eliminate duplicate records by checking the object uids, which are internally assigned and globally unique. In the post-processing step, objects will go through a filtering process that eliminates duplicate records. Intuitively, such approach would incur extra query processing cost due to the replication and duplicate elimination steps. However, this additional cost is very small and insignificant compared to the overall query processing time [6]. 4.4 Performance The RESQUE engine is highly efficient compared to traditional spatial engine [6, 5]. In particular, the ondemand R*-Tree construction cost is less than one percent of overall spatial join cost, and it does not incur any index maintenance overhead as we discard the index after processing the query. The geometric computation is the dominant cost in cross matching spatial joins. While this is difficult to support through I/O optimization oriented parallel spatial database systems, Hadoop-GIS is well adapted for such computations and outperforms parallel spatial DBMS [6]. In particular, Hadoop-GIS achieves high scalability as the on-demand spatial query engine can be easily executed in multiple parallel MapReduce tasks on cluster nodes. 5 GPU Supported Spatial Queries GPUs employ a SIMD architecture that executes the same instruction logic on a large number of cores simultaneously. Many spatial algorithms and geometry computations do not naturally fit into such parallelization model. Two alternative approaches are proposed for GPU based geometric operation, in particular, polygon intersection. Monte-Carlo Based Method. This approach uses Monte-Carlo method for rasterization, which transforms the combined spatial space of two polygons into pixel based representation. After such transformation, the original vector geometric computation can now be performed on the pixel based representation. The intersection area thus can be determined by counting pixels belonging to both polygons. A common approach to check if a pixel is within a polygon is to use ray tracing [11] for point-in-polygon test. As the operation for each pixel is fully independent from each other, they can be effectively executed in parallel by GPU threads [15]. Rasterization resolution is critical for achieving best performance. A high resolution rasterization yields larger number of pixels, and consequently increases the compute intensity of the geometry computations. A low resolution rasterization could increase computation efficiency, but will lead to loss of accuracy. PixelBox. A more adaptive approach [15] named PixelBox can reduce the computation intensity while ensuring the computation accuracy. Specifically, PixelBox first partitions the space into cells or boxes. For boxes containing edges of polygons, rasterization is performed as Monte-Carlo approach. In this way, group of pixels in a box could be tested together for the containment relationship with a polygon, and pixel level testing is performed only for edge crossing areas. Thus, the computational efficiency could be much improved. The experiments demonstrate two orders performance improvement for intersection operation compared to a single thread CPU algorithm. Integration of GPU Based Geometric Computation into MapReduce. To support a more efficient execution on accelerated systems, we have been extending Hadoop-GIS for execution of spatial query operations with GPUs in distributed memory machines. The goal is to design an efficient bridge interface between the MapReduce program and the GPU program. Many small tasks sent to GPU may incur much overhead on communication and subdue the benefit of GPU. We propose a prediction model to decide the granularity of tasks for GPU invocation, by considering both data communication cost and execution cost for different types of spatial operations. Another goal is to achieve load balancing and data/operation aware task assignment in
8 the CPU/GPU hybrid environment. We first take a knowledge based approach to decide assignments to CPU or GPU, and then build efficient task migration between the CPU and the GPU in case of an unbalanced task assignment. Preliminary work is reported in [4]. 6 Software The high adaptability of the framework allows the system to be integrated into computer clusters or cloud computing environments such as Amazon EC2. Hadoop-GIS is available as a set of library functions, including input data transformation, data partitioning, spatial indexing and spatial query processing, and MapReduce based execution. It also includes pipelines for combined multiple query jobs. We implemented the core spatial indexing and querying methods in C++, and implemented the MapReduce programs in Java. We use the Hadoop streaming mechanism to bridge the communication between C++ libraries and Java based MapReduce programs. The pipelines are as set of scripts which can be easily customized. Hadoop-GIS is open source, and it can be downloaded from the web site [2]. Acknowledgments This work is supported in part by NSF IIS , by NSF ACI , by NLM R01LM009239, and by NCI 1U24CA A1. References [1] Open Street Map: [2] Hadoop-GIS: [3] A. Aji, X. Sun, H. Vo, Q. Liu, R. Lee, X. Zhang, J. Saltz, and F. Wang. Demonstration of hadoop-gis: A spatial data warehousing system over mapreduce. In SIGSPATIAL/GIS, pages ACM, [4] A. Aji, G. Teodoro, and F. Wang. Haggis: Turbocharge a mapreduce based spatial data warehousing system with gpu engine. In BigSpatial 14, pages 15 20, [5] A. Aji, F. Wang, and J. H. Saltz. Towards Building A High Performance Spatial Query System for Large Scale Medical Imaging Data. In SIGSPATIAL/GIS, pages ACM, [6] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce. Proc. VLDB Endow., 6(11): , Aug [7] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using r-trees. In ICDE, [8] M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In SIGMOD, pages , [9] J. Patel et al. Building a scaleable geo-spatial dbms: technology, implementation, and evaluation. In SIGMOD, pages , [10] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages , [11] T. J. Purcell, I. Buck, W. R. Mark, and P. Hanrahan. Ray tracing on programmable graphics hardware. In ACM TOG, volume 21, pages ACM, [12] H. Vo, A. Aji, and F. Wang. Sato: A spatial data partitioning framework for scalable query processing. In SIGSPA- TIAL/GIS. ACM, [13] F. Wang, J. Kong, L. Cooper, T. Pan, K. Tahsin, W. Chen, A. Sharma, C. Niedermayr, T. W. Oh, D. Brat, A. B. Farris, D. Foran, and J. Saltz. A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform, 2(1):32, [14] F. Wang, J. Kong, J. Gao, D. Adler, L. Cooper, C. Vergara-Niedermayr, Z. Zhou, B. Katigbak, T. Kurc, D. Brat, and J. Saltz. A high-performance spatial database based approach for pathology imaging algorithm evaluation. J Pathol Inform, 4(5), [15] K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. Proc. VLDB Endow., 5(11): , [16] X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. GeoInformatica, 2(2): , 1998.
High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
Hadoop GIS: A High Performance Spatial Data Warehousing System over MapReduce
Hadoop GIS: A High Performance Spatial Data Warehousing System over MapReduce Ablimit Aji 1 Fusheng Wang 2 Hoang Vo 1 Rubao Lee 3 Qiaoling Liu 1 Xiaodong Zhang 3 Joel Saltz 2 1 Department of Mathematics
NIH Public Access Author Manuscript Proceedings VLDB Endowment. Author manuscript; available in PMC 2013 October 31.
NIH Public Access Author Manuscript Published in final edited form as: Proceedings VLDB Endowment. ; 6(11):. Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce Ablimit Aji 1,
Hadoop-GIS: A High Performance Spatial Query System for Analytical Medical Imaging with MapReduce
Center for Comprehensive Informatics Technical Report Hadoop-GIS: A High Performance Spatial Query System for Analytical Medical Imaging with MapReduce Fusheng Wang Ablimit Aji Qiaoling Liu Joel H. Saltz
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1 Yin Huai 1 Rubao Lee 1 Fusheng Wang 2,3 Xiaodong Zhang 1 Joel H. Saltz 2,3 1 Department of Computer Science and
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Mobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
2015 The MathWorks, Inc. 1
25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
Hadoop vs Apache Spark
Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies
Where is... How do I get to...
Big Data, Fast Data, Spatial Data Making Sense of Location Data in a Smart City Hans Viehmann Product Manager EMEA ORACLE Corporation August 19, 2015 Copyright 2014, Oracle and/or its affiliates. All rights
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms
MapReduce algorithms for GIS Polygonal Overlay Processing
MapReduce algorithms for GIS Polygonal Overlay Processing Satish Puri, Dinesh Agarwal, Xi He, and Sushil K. Prasad Department of Computer Science Georgia State University Atlanta - 30303, USA Email: spuri2,
Big Data: Using ArcGIS with Apache Hadoop
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
MapReduce With Columnar Storage
SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Combining the benefits of RDBMS and NoSQL database systems
DATA WAREHOUSING RESEARCH TRENDS Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Data source heterogeneity and incongruence Filtering out uncorrelated data Strongly unstructured
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
HPC technology and future architecture
HPC technology and future architecture Visual Analysis for Extremely Large-Scale Scientific Computing KGT2 Internal Meeting INRIA France Benoit Lange benoit.lange@inria.fr Toàn Nguyên toan.nguyen@inria.fr
Large-Scale Spatial Data Management on Modern Parallel and Distributed Platforms
Large-Scale Spatial Data Management on Modern Parallel and Distributed Platforms Thesis Proposal Abstract Rapidly growing volume of spatial data has made it desirable to develop efficient techniques for
Spatial data are mainly used for scientific
Spatial Data and Hadoop Utilization Georgios Economides Georgios Piskas Sokratis Siozos-Drosos Department of Informatics, Aristotle University of Thessaloniki, 54006 Thessaloniki, Greece Keywords: Data
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Big Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
SQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
The Evolvement of Big Data Systems
The Evolvement of Big Data Systems From the Perspective of an Information Security Application 2015 by Gang Chen, Sai Wu, Yuan Wang presented by Slavik Derevyanko Outline Authors and Netease Introduction
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Understanding the Value of In-Memory in the IT Landscape
February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Oracle Big Data Spatial and Graph
Oracle Big Data Spatial and Graph Oracle Big Data Spatial and Graph offers a set of analytic services and data models that support Big Data workloads on Apache Hadoop and NoSQL database technologies. For
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Matsu Workflow: Web Tiles and Analytics over MapReduce for Multispectral and Hyperspectral Images
Matsu Workflow: Web Tiles and Analytics over MapReduce for Multispectral and Hyperspectral Images Open Cloud Consortium Open Data Group September 7, 2012 1 Materials for Matsu Matsu is a public project.
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Towards Analytical Data Management for Numerical Simulations
Towards Analytical Data Management for Numerical Simulations Ramon G. Costa, Fábio Porto, Bruno Schulze {ramongc, fporto, schulze}@lncc.br National Laboratory for Scientific Computing - RJ, Brazil Abstract.
HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based