High Performance Analysis of Big Spatial Data
|
|
|
- Rosalyn O’Neal’
- 10 years ago
- Views:
Transcription
1 2015 IEEE International Conference on Big Data (Big Data) High Performance Analysis of Big Spatial Data David Haynes 1, Suprio Ray 2, Steven M. Manson 3, Ankit Soni 1 1 Minnesota Population Center, University of Minnesota {dahaynes, asoni}@umn.edu 2 Faculty of Computer Science, University of New Brunswick, Fredericton [email protected] 3 Department of Geography, University of Minnesota [email protected] Abstract Every year research institutions produce petabytes of data. Yet, only a small percent of the data is readily accessible for analysis. Terra Populus acts as the bridge between big data sources and researchers. Researchers are provided convenient web applications that allow them to access, analyze, and tabulate different datasets under a common platform. Terra Populus is developing three unique applications. The first application, Paragon, is a prototype parallel spatial database, which aims to extend the functionality of PostgreSQL and PostGIS onto multinode systems. Terra Populus Tabulator application employs Parquet on Spark to build dynamic queries for analyzing large population survey data. The last application, Terra Explorer, is an exploratory analysis tool for visualizing the spatial datasets within the repository. Index Terms big spatial data, spatial analysis, spark, parquet, web mapping, dynamic visualization I. INTRODUCTION The Terra Populus project demonstrates emerging capabilities in integrating complex big data namely, heterogeneous spatial data within high performance computation environments. Terra Populus identifies, acquires, and develops various data sources ranging from historic handwritten census forms to current satellite observations of the earth. The aim of the project is to preserve these data and make them Internet-accessible, so they are readily available for scholars, students, policy makers, and members of the public. Terra Populus is a member of the National Science Foundation, Datanet Initiative, the goal of which is to develop cyber infrastructure that preserves, integrates, and provides open access to scientific data. Terra Populus utilizes an open source ecosystem to provide access to heterogeneous big data (microdata, raster data, and vector data) for researchers studying a range of social and natural systems along with human-environment interactions. A primary challenge for this project has been developing methods that preserve the integrity of the underlying data, are sufficiently flexible to support the needs of many different users, and scale for high performance computation. The tools and platforms that we are developing provide users with the ability to create tailored queries, that may be analyzed quickly, which in turn fosters the growth of scientific knowledge. The Terra Populus data collection exemplifies two of primary V s commonly attributed to big data: variety, volume. We deal with data from a wide array of sources (variety); some of the large databases in the world (volume); and with remotely sensed imagery, among others, which are collected in large and rapidly increasing amounts. Terra Populus also deals with big data characterized by two newer V s, namely value (integrating our data products provides new greater scientific understanding) and veracity (our data collection is composed of gold-standard which is vetted for research). The data processing toolset that we are developing seeks to accommodate all of these big data characteristics. There are a growing number of big data processing and analytics toolsets, yet there are is a paucity of tools (or even basic research) that work with heterogeneous big spatial data or provide interoperability of between datasets. This paper describes three computational challenges that exemplify big spatial data and are being tackled by the Terra Populus project. We examine in particular our approaches for high performance spatial analysis, dynamic tabulation, and dynamic visualization. II. HIGH PERFORMANCE SPATIAL ANALYSIS: PARAGON Spatial databases are the driving force behind most traditional GIS applications, in that, domains ranging from land surveys to city planning or resource monitoring depend on the existence of a spatial database on which all other operations rely. While big data applications are remaking the form of spatial databases, particularly due to the rapid rise in data volume, there will always remain a need for spatial analysis. As a result, we are seeing new classes of spatial applications and Terra Populus is developing new methods that operate on spatial data. Regardless of data source, spatial queries lie at the heart of spatial analysis. Spatial queries can take many forms, including range queries (e.g., return all points of land higher than 4,000m inside a national park), K nearest neighbors queries, and join queries (e.g., all rivers that have bridges that cross them), among other. Whereas geospatial Web services such as Google Maps, are driven by short-running range queries, many of the emerging spatial analytics applications are characterized by long-running spatial join queries [6]. For /15/$ IEEE 1892
2 example, the polyline crosses polylines spatial join query involving 73 million records takes over 20 hours [4]. These queries can benefit from high performance query processing. With the growing popularity of the open source MapReduce framework Hadoop, a number of MapReduce based spatial query processing systems have been developed. These include academic projects like Spatial Hadoop [2] and Hadoop-GIS [9], along with commercial projects such as GeoTrellis [5]. However, the set of spatial query features supported by these solutions are still limited, especially compared to standard GIS products that operate on relational tables. This is why Terra Populus is currently exploring other options, including the development of our own system. Like several other MapReduce based non-spatial systems, there is a trend towards implementing relational database features. This move toward relational-like features has led to the observation [7] that these SQL on MapReduce systems have come full circle, in that they resemble shared-nothing parallel relational databases. Paulson et al. [10] argues that parallel relational databases take advantage of over 20 years of research, which MapReduce systems cannot match. With extensive evaluation, they demonstrated that shared-nothing parallel relational databases perform significantly better than MapReduce systems. Therefore, we have adopted the view that a shared-nothing parallel database can offer the best high performance platform for spatial query processing. Since there is no existing open-source shared-nothing parallel spatial database, we are developing one called Paragon. Paragon is a parallel spatial database that runs a PostgreSQL database instance in each of the nodes in a cluster of machines. Spatial support was added to PostgreSQL in 2005, as part of PostGIS extension. While the initial release only included support of vector data sets, subsequent versions have included support for raster datasets. As PostGIS has evolved over the years, so have parallel environments for PostgreSQL. When first developed it only operated on a single core; even today when placed on servers with multiple cores it can only utilize a single core per query. Multiple projects have made an effort to scale PostgreSQL. GridSQL was one of the first notable projects to make such an effort, followed by Stado [12]. Stado supports vector datatypes and but lacks full functionality for distributing spatial data across a cluster of machines. There are a number of research issues that are pertinent to Paragon in the context of heterogeneous big spatial data management. In a shared-nothing parallel database, the dataset is horizontally partitioned and assigned to each node in the cluster. With spatial data, this partitioning process is known as spatial declustering. This process involves logically decomposing the spatial domain into 2-dimensional partitions. These partitions are called tiles. The spatial domain could be partitioned into regular grids or using another method such as recursive decomposition based on a constraint on the object count in each tile. Since a spatial object s extent may overlap more than one of these tiles, it may be necessary to replicate such an object to every partition that it overlaps. Due to these considerations, there exists a few design choices and as a consequence several spatial declustering approaches have been proposed [18, 19, 4, 11]. The creation of tiles through spatial declustering enables parallel join, as each tile could be processed independently. The overall performance of a spatial join is determined by the execution time of the slowest tile. The spatial declustering approach must take into account of the object distribution skew, processing skew and the fact that the refinement step dominates the two step spatial query evaluation [11]. Furthermore, within the Terra Populus project, a key requirement is to be able to join vector and raster datasets. Therefore, the spatial declustering approach must address the fact that vector datasets are discrete single layer objects, whereas rasters can be comprised of multiple layers with varying spatial resolutions. Paragon supports both vector and raster data types. We are developing spatial declustering algorithms that address the above mentioned issues. Additionally, the query scheduler needs to coordinate query tasks intelligently among computing nodes. Paragon executes a spatial join query in an iterative fashion, each node processing one partition at a time. This is inspired by our previous work [4, 11]. However a limitation of the previous approach is that each node needed to host the entire set of partitions. We are exploring spatial partitioning approaches, such that each node can host a subset of the partitions. Finally, to achieve good performance, the query optimizer needs to generate ideal query plans [4] while executing spatial join queries, essentially taking spatial locality into account. Although Stado supports vector-based spatial queries, it does not support spatial declustering based data distribution and does not generate ideal query plans with spatial queries. As a result, the parallel performance of a spatial join query with Stado could actually be worse than that of the sequential execution with a PostgreSQL instance. For example, we demonstrated in [20] that the execution time of the query Polyline Intersects Polygon with a 2-node Stado system is significantly longer than that with a single instance PostgreSQL. To address the above-mentioned issues we have developed Paragon, as an extension to Stado [12]. Paragon exploits a cluster of nodes, each of which hosts a PostgreSQL/PostGIS database instance. We present a preliminary evaluation of Paragon for a dataset consisting of line and polygon objects from the TIGER [6] California dataset. The details of the dataset are shown in table 1. The current implementation of Paragon uses a variant of round-robin declustering. The declustering algorithm produced 1024 spatial partitions after processing the dataset in Table 1. The physical storage and management of the partitions in Paragon is done by taking advantage of PostgreSQL s sharding feature [16]. We extended the SQL create table statement to specify spatial declustering parameters, such as, the number of partitions to be created, the declustering method, and a label for the declustering scheme. To execute a spatial join, the labels of the two tables, being joined, must match. This mechanism allows the same spatial dataset to be partitioned using different declustering schemes. 1893
3 Table 1. Spatial data used for comparison Database Table (acronym) Geometry Number of Objects Area-water (Aw) Polygon 39,334 Area-landmass (AI) Polygon 5,5951 Edge (Ed) Polyline 4,173,498 Table 2. Comparison of Query Times: Paragon vs PostgreSQL Query (acronym) Polygon overlaps Polygon (Aw_ov_Aw) Polyline Touches Polygon (ED_to_Al) Polyline Crosses Polyline (Ed_cr_Ed) PostgreSQL Paragon Speedup (seconds) (seconds) We executed spatial join queries from the Jackpine spatial database benchmark [6] with Paragon in a two node cluster. The queries are expressed in SQL with some of the spatial predicates adopted by Open Geospatial Consortium (OGC). For instance, Code 1 demonstrates the Polyline Touches Polygon query shown in Table 2. Code 1. Spatial SQL Query SELECT COUNT(*) FROM edges ed, arealm al WHERE ST_Touches(ed.geom, al.geom); A comparison of observed single run execution times is shown in Table 2. The results show, Paragon achieves near linear speedup with 2 nodes, which bodes well for moving the system to multiple nodes. We plan to evaluate Paragon with a larger cluster in the future. Although optimized for spatial join, since Paragon is a parallel relational database, it is well suited for complex joins involving spatial and non-spatial tables. Our hope is that Paragon will become a useful tool for big geospatial data processing to the research community as well as enterprises III. DYNAMIC TABULATION Microdata is the term used to denote survey data on individuals and households collected by government census agencies. It is extremely rich data, due to its flexibility for tabulation. Since a microdata record contains all the responses of a single individual, researchers are able to create new tabulations from the original data. Currently Terra Populus is investigating high performance computation platforms that will allow users to build their own tabulations from microdata datasets and its variables. A Terra Populus user can use the system and tabulate the data by selecting the datasets, the row and column variables and any constraints or filters which should be applied on those variables. Usually a geographic variable (e.g. county code) forms a row variable, in order for users to analyze data geographically, however that is not a necessary. For example, a researcher may be interested in studying the effect of education attainment on marital status and conception of first child. The proposed system could generate/tabulate this targeted population data-set to the researcher through a tabulation definition, where the sex is female, who have had at least one child, age is between 16-45, and tabulated against all levels of education. Such tabulations are common in social science research and support specific research questions. In this example, the system would construct a spark operation similar to, Code 2. Code 2. Pseudo Code in Python data_frame.filter( SEX = F AND NUMBER OF CHILDREN>1 AND 16<=AGE<=45 ).groupby([ dataset, AGE, EDUCATION ATTAINMENT, MARITALSTATUS, ELDEST CHILD ]).count() In the initial design, we developed a tabulator which had pre-defined rules on each variable/column of each data-set. In this case, users were only allowed to selected data defined by those pre-defined rules. For example, the variable AGE was divided into subgroups of 0 to 4, 5 to 9 and so on. The data was physically limited to theses predefined aggregations. Therefore, when using the initial tabulator aggregations could only take place on the defined variable/columns rules like age_5_9. However, given the size and granularity of different data-sets, we proposed designing a new tabulator that removes all pre-existing aggregations, placing all the age data under a single variable/column AGE. The result is that the end-user is has more functionality at their disposal and can create custom queries that are appropriate for the study, such as tabulating all persons age six (e.g. age= 6). This increases the user querying flexibility on our system and provides ability to define custom queries as described in Code 2. During our initial phase tests, we also found that relational databases do not scale well when working with data-sets where the table structure is in a de-normalized form with a large number of columns. This is due to the nature of traditional databases, where they carry most columns of the rows while performing operations on the selected data. This makes on the fly operations difficult to perform when the number of columns becomes large. In addition, we have evidence that aggregation or large ranged queries do not also perform well on these de-normalized form tables. This led us towards column store databases. The proposed tabulator employs a columnar storage database due to the reasons that storage, retrieval, selection and filter operations can be done efficiently. As our data extraction system utilizes PostgreSQL and PostGIS for spatial data interoperability, we investigated using a columnar storage extension for PostgreSQL, cstore_fdw, which is a foreign data wrapper for PostgresSQL. However, we have moved away from using cstore_fdw for tabulation as it has an upper limit of 1600 columns, whereas some data-sets or union of data-sets can contain more columns than that. Currently we are testing Apache Spark s Parquet as the columnar storage database [13] for our next generation tabulator. Parquet allows for columnar storage in which columns are type-casted to Parquet s inbuilt data types, allowing for greater compression per data type. This approach gives a high compression ratio while still allowing for fast data fetching. This is based on an existing record shredding and assembly algorithm [14]. Although converting regular flat 1894
4 files to Parquet format is a time consuming process, it does make up for it by optimizing the query performance on large amounts of data. Since our usecase has rare updates, a single Parquet conversion results in better query performance over time. Once converted to Parquet format, the datasets can then be queried via Spark SQL [15]. We tested the performance of Parquet using IPUMS-I [17] datasets, which is the primary microdata source of Terra Populus[1]. These tests were performed on a single machine with eight processors and six gigabytes of memory. Table 3 shows multiple query results. Query 1 returns a summary count of records, where the records are filtered by a column value and grouped by data (e.g. count of males in Argentina 1970). Code 3. Spark SQL Query 1 in Python all_people.filter( SEX = 1 ).groupby([ SEX, dataset ]).count() Query 2 employs filters on two columns, and performs three aggregations (e.g count of males years old in Argentina 1970). Query 3 employs four filters on three columns and performs aggregations on five column variables. The dataset chosen for the single dataset test, row 1, was Brazil 2010 which has 309 columns. The datasets chosen for row four consists of multiple country datasets in South American. The dataset has 4326 unique columns across 56 different datasets. The datasets chosen for rows two and three are subsets of the data found in row four. Table 3. Parquet query performance on multiple datasets Number Number of Time for Time for Time for of datasets records Query 1 Query 2 Query million 5 seconds 9 seconds 10 seconds million 7 seconds 10 seconds 18 seconds million 10 seconds 12 seconds 27 seconds million 7.5 seconds 20 seconds 30 seconds IV. DYNAMIC VISUALIZATION The integration and development of our data repository will grant researchers access to new datasets and formats. While many of our existing users are used to, and prefer, text-based interfaces, we expect some portion of our user community to want visual/spatial interfaces as well. We are therefore developing a web mapping application that acts as a data exploration tool. Terra Populus is rare among humanenvironment data portals in that users have the ability to tailor their extracts to the datasets they need. This presents opportunities and challenges because users may identify or exclude datasets from their extracts. The visualization application that we are developing helps users understand the availability of data within the study area of their choice. The application has three main components: a web mapping pane, data selection pane, and a data visualization pane, Fig. 1. We use Geoserver, a Java server application that provides spatial data delivery for web applications [8]. The application uses the JavaScript library OpenLayers to consume and display these spatial datasets on the mapping interface. Through this application Terra Populus integrates spatial and non-spatial datasets. Figure 1. Terra Explorer Landing Page The primary purpose of the application is to support data exploration, and our development efforts have therefore focused on implementing a dynamic and interactive framework that reflects the other applications within our eco-system. Currently, the application allows users to interactively visualize vector and raster datasets. Areal datasets are dynamically created and styled using choropleth mapping techniques and Geoserver s REST SLD API. Variable substation is applied to rasters to elicit a similar effect. In this approach we each value of a categorical raster is an independent category. For example, IGBP MODIS satellite imagery has pixel values which range from 0 to 255, with pixel values 0-16 representing landcover types (i.e. 0 = water, 12 = croplands). The application allows the user to adjust the opacity values for any of raster pixel value. This enhances the user s understanding of the presence of particular components of dataset within their study region. The web mapping interface allows for data exploration, and facilitates the understanding of presence of data. However, web mapping is not the only visualization tool in our application. Data visualizations (i.e. bar charts and plots) combined with mapping can enhance the user s understanding of data. Data summarization of our areal (population demographic) and tabulated datasets are relatively straightforward and can be implemented using the previously described tabulator. However, real-time summarization of raster datasets is a computationally intensive problem that remains unsolved. To implement this capacity we have worked with the Spatial Hadoop team to develop a prototype spatial overlay engine [2]. This new prototype supports zonal statistic calculations of rasters in their native GeoTIFF format using Spark. The prototype engine calculates zonal statistics for any spatial region, and returns via API summary statistics (i.e. min, max, mean, sum, count). To perform this analysis a spatial overlay between vector and raster datasets, is implemented with the Spatial Hadoop platform. The engine must be able to handle different spatial projections, as the project utilizes datasets derived from different satellite instruments. To reduce data errors caused from projections, we project vector datasets to the raster coordinate system and then perform the zonal analysis. The first step is to tile and compress the raster. This reduces I/O costs and memory challenges. While doing this, we 1895
5 create a metadata file that provides the spatial bounding box of each compressed raster tiles. Next a scan-line fill algorithm is used to determine, which pixels lie within the geographic unit. The third step aggregates the pixels and their values by zone/geographic unit and returns the result. The resulting data can then be visualized with data visualization panes. V. CONCLUSION Terra Populus is advancing our ability to integrate complex big data in a high performance computation environment. Terra Populus has chosen to employ open source software to manage a large range of data in a mix of formats (microdata, raster data, and vector data) for use by researchers studying a range of social, biophysical, and human-environment systems. While there are a growing number of open-source candidates to meet the challenges faced by Terra Populus, there is a profound need for a general framework that operates on heterogeneous big data. We focus on the solutions for high performance spatial analysis, dynamic tabulation, and dynamic visualization. REFERENCES [1] Terra Populus. [2] A. Eldawy, M. F. Mokbel. SpatialHadoop: A MapReduce Framework for Spatial Data. ICDE [3] S. Wang. A CyberGIS Framework for the Synthesis of Cyberinfrastructure, GIS, and Spatial Analysis. AAAG [4] S. Ray, B. Simion, A. D. Brown and R. Johnson. A Parallel Spatial Data Analysis Infrastructure for the Cloud. SIGSPATIAL GIS [5] GeoTrellis. [6] S. Ray, B. Simion, and A. D. Brown. Jackpine: A Benchmark to Evaluate Spatial Database Performance. In ICDE [7] A. Floratou, U. M. Minhas and F. Özcan. SQL-on_Hadoop: Full Circle Back to Shared-Nothing Database Architectures. VLDB [8] GeoServer. [9] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang and J. Saltz. Hadoop-GIS: A High Performance Spatial Data Warehouse System Over MapReduce. VLDB [10] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to largescale data analysis. SIGMOD [11] S. Ray, B. Simion, A. D. Brown and R. Johnson. Skew-Resistant Parallel In-memory Spatial Join. SSDBM [12] Stado. [13] Parquet. [14] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis Dremel: interactive analysis of web-scale datasets. VLDB [15] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia Spark SQL: Relational Data Processing in Spark. SIGMOD [16] PostgreSQL Partitioning. docs/8.3/static/ddl-partitioning.html [17] IPUMS-I. [18] J. M. Patel and D. J. DeWitt. Partition based spatial-merge join. SIGMOD [19] J. M. Patel and D. J. DeWitt. Clone join and shadow join: two parallel spatial join algorithms. SIGSPATIAL GIS [20] D. Haynes, S. Ray, S. Manson, D. Van Riper, A. Soni and A. D. Brown. Towards A High Performance System for Heterogeneous Big Spatial Data. CyberGIS AHM,
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
Oklahoma s Open Source Spatial Data Clearinghouse: OKMaps
Oklahoma s Open Source Spatial Data Clearinghouse: OKMaps Presented by: Mike Sharp State Geographic Information Coordinator Oklahoma Office of Geographic Information MAGIC 2014 Symposium April 28-May1,
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
How To Write A Mapreduce Based Query Engine
High Performance Spatial Queries for Spatial Big Data: from Medical Imaging to GIS Fusheng Wang 1, Ablimit Aji 2, Hoang Vo 3 1 Department of Biomedical Informatics, Department of Computer Science, Stony
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Spatial Data Analysis Using MapReduce Models
Advancing a geospatial framework to the MapReduce model Roberto Giachetta Abstract In recent years, cloud computing has reached many areas of computer science including geographic and remote sensing information
GIS Databases With focused on ArcSDE
Linköpings universitet / IDA / Div. for human-centered systems GIS Databases With focused on ArcSDE Imad Abugessaisa [email protected] 20071004 1 GIS and SDBMS Geographical data is spatial data whose
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
HydroDesktop Overview
HydroDesktop Overview 1. Initial Objectives HydroDesktop (formerly referred to as HIS Desktop) is a new component of the HIS project intended to address the problem of how to obtain, organize and manage
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
SpatialHadoop: Towards Flexible and Scalable Spatial Processing using MapReduce
SpatialHadoop: Towards Flexible and Scalable Spatial Processing using MapReduce Ahmed Eldawy Expected Graduation: December 2015 Supervised by: Mohamed F. Mokbel Computer Science and Engineering Department
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Cloud application for water resources modeling. Faculty of Computer Science, University Goce Delcev Shtip, Republic of Macedonia
Cloud application for water resources modeling Assist. Prof. Dr. Blagoj Delipetrev 1, Assist. Prof. Dr. Marjan Delipetrev 2 1 Faculty of Computer Science, University Goce Delcev Shtip, Republic of Macedonia
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
Web Map Service Architecture for Topographic Data in Finland
Web Map Service Architecture for Topographic Data in Finland Teemu Sipilä National Land Survey of Finland Abstract. Since 2012 National Land Survey of Finland has been renewing its web map services and
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
There are various ways to find data using the Hennepin County GIS Open Data site:
Finding Data There are various ways to find data using the Hennepin County GIS Open Data site: Type in a subject or keyword in the search bar at the top of the page and press the Enter key or click the
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
A Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
How To Use Hadoop For Gis
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
A Hybrid Architecture for Mobile Geographical Data Acquisition and Validation Systems
A Hybrid Architecture for Mobile Geographical Data Acquisition and Validation Systems Claudio Henrique Bogossian 1, Karine Reis Ferreira 1, Antônio Miguel Vieira Monteiro 1, Lúbia Vinhas 1 1 DPI Instituto
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
THE STATE OF GEO BIG DATA IN OPEN SOURCE. Rob Emanuele
THE STATE OF GEO BIG DATA IN OPEN SOURCE Rob Emanuele Who am I? open source geospatial developer working with big geo data. developer at Azavea in Philadelphia, US. maintainer of the GeoTrellis project.
Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics
Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics BY FRANÇOYS LABONTÉ GENERAL MANAGER JUNE 16, 2015 Principal partenaire financier WWW.CRIM.CA ABOUT CRIM Applied research
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
ArcGIS Data Models Practical Templates for Implementing GIS Projects
ArcGIS Data Models Practical Templates for Implementing GIS Projects GIS Database Design According to C.J. Date (1995), database design deals with the logical representation of data in a database. The
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
Big Data and Analytics: Challenges and Opportunities
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
Institute of Computational Modeling SB RAS
Institute of Computational Modeling SB RAS ORGANIZATION OF ACCESS TO OBSERVATIONAL DATA USING WEB SERVICES FOR MONITORING SYSTEMS THE STATE OF THE ENVIRONMENT Kadochnikov Aleksey A. Russia, Krasnoyarsk
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
The ORIENTGATE data platform
Seminar on Proposed and Revised set of indicators June 4-5, 2014 - Belgrade (Serbia) The ORIENTGATE data platform WP2, Action 2.4 Alessandra Nuzzo, Sandro Fiore, Giovanni Aloisio Scientific Computing and
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Navigating Big Data business analytics
mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool FOSS4G 2010 Dr. Thierry Badard, CTO Spatialytics inc. Quebec, Canada [email protected] Barcelona, Spain Sept 9th, 2010 What is GeoKettle? It is
Sisense. Product Highlights. www.sisense.com
Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze
Parallel Visualization for GIS Applications
Parallel Visualization for GIS Applications Alexandre Sorokine, Jamison Daniel, Cheng Liu Oak Ridge National Laboratory, Geographic Information Science & Technology, PO Box 2008 MS 6017, Oak Ridge National
INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD ARCHITECT @ METAMARKETS
INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD ARCHITECT @ METAMARKETS MICHAEL E. DRISCOLL CEO @ METAMARKETS - @MEDRISCOLL Metamarkets is the bridge from
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Oracle Big Data Spatial and Graph
Oracle Big Data Spatial and Graph Oracle Big Data Spatial and Graph offers a set of analytic services and data models that support Big Data workloads on Apache Hadoop and NoSQL database technologies. For
<Insert Picture Here> Data Management Innovations for Massive Point Cloud, DEM, and 3D Vector Databases
Data Management Innovations for Massive Point Cloud, DEM, and 3D Vector Databases Xavier Lopez, Director, Product Management 3D Data Management Technology Drivers: Challenges & Benefits
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Approaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies [email protected] This paper
A WEB GIS FOR WETLANDS OF KERALA USING OPEN SOURCE GEOSPATIAL SOFTWARE. Santosh Gaikwad* and S Narendra Prasad**
A WEB GIS FOR WETLANDS OF KERALA USING OPEN SOURCE GEOSPATIAL SOFTWARE Santosh Gaikwad* and S Narendra Prasad** Salim Ali Centre for Ornithology and Natural History Deccan Regional Station, Hyderabad *[email protected]
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Spatial Data Analysis
14 Spatial Data Analysis OVERVIEW This chapter is the first in a set of three dealing with geographic analysis and modeling methods. The chapter begins with a review of the relevant terms, and an outlines
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
COMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
ARCHITECTURE OF INTEGRATED GIS AND GPS FOR VEHICLE MONITORING
1 st Logistics International Conference Belgrade, Serbia 28-30 November 2013 ARCHITECTURE OF INTEGRATED GIS AND GPS FOR VEHICLE MONITORING Adela B. Crnišanin * State University of Novi Pazar, Department
Vector Web Mapping Past, Present and Future. Jing Wang MRF Geosystems Corporation
Vector Web Mapping Past, Present and Future Jing Wang MRF Geosystems Corporation Oct 27, 2014 Terms Raster and Vector [1] Cells and Pixel Geometrical primitives 2 Early 2000s From static to interactive
Where is... How do I get to...
Big Data, Fast Data, Spatial Data Making Sense of Location Data in a Smart City Hans Viehmann Product Manager EMEA ORACLE Corporation August 19, 2015 Copyright 2014, Oracle and/or its affiliates. All rights
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015
City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015 Part I Course Title: Data-Intensive Computing Course Code: CS4480
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Oracle Platform GIS & Location-Based Services. Fred Louis Solution Architect Ohio Valley
Oracle Platform GIS & Location-Based Services Fred Louis Solution Architect Ohio Valley Overview Geospatial Technology Trends Oracle s Spatial Technologies Oracle10g Locator Spatial Oracle Application
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Native Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
UK Location Programme
Location Information Interoperability Board Data Publisher How To Guide Understand the background to establishing an INSPIRE View Service using GeoServer DOCUMENT CONTROL Change Summary Version Date Author/Editor
Developing Business Intelligence and Data Visualization Applications with Web Maps
Developing Business Intelligence and Data Visualization Applications with Web Maps Introduction Business Intelligence (BI) means different things to different organizations and users. BI often refers to
SUMMER SCHOOL ON ADVANCES IN GIS
SUMMER SCHOOL ON ADVANCES IN GIS Six Workshops Overview The workshop sequence at the UMD Center for Geospatial Information Science is designed to provide a comprehensive overview of current state-of-the-art
Fluency With Information Technology CSE100/IMT100
Fluency With Information Technology CSE100/IMT100 ),7 Larry Snyder & Mel Oyler, Instructors Ariel Kemp, Isaac Kunen, Gerome Miklau & Sean Squires, Teaching Assistants University of Washington, Autumn 1999
A Web services solution for Work Management Operations. Venu Kanaparthy Dr. Charles O Hara, Ph. D. Abstract
A Web services solution for Work Management Operations Venu Kanaparthy Dr. Charles O Hara, Ph. D Abstract The GeoResources Institute at Mississippi State University is leveraging Spatial Technologies and
Standardized data sharing through an open-source Spatial Data Infrastructure: the Afromaison project
Standardized data sharing through an open-source Spatial Data Infrastructure: the Afromaison project Yaniss Guigoz - University of Geneva/GRID-Geneva 1 EcoARM2ERA and AFROMAISON Why Afromaison in this
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
An Introduction to Open Source Geospatial Tools
An Introduction to Open Source Geospatial Tools by Tyler Mitchell, author of Web Mapping Illustrated GRSS would like to thank Mr. Mitchell for this tutorial. Geospatial technologies come in many forms,
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
The Era of Big Spatial Data
The Era of Big Spatial Data Ahmed Eldawy Mohamed F. Mokbel Computer Science and Engineering Department University of Minnesota, Minneapolis, Minnesota 55455 Email: {eldawy,mokbel}@cs.umn.edu Abstract The
Visualization Method of Trajectory Data Based on GML, KML
Visualization Method of Trajectory Data Based on GML, KML Junhuai Li, Jinqin Wang, Lei Yu, Rui Qi, and Jing Zhang School of Computer Science & Engineering, Xi'an University of Technology, Xi'an 710048,
