SciDB: an open-source, array-oriented database management system
|
|
|
- Clyde Berry
- 10 years ago
- Views:
Transcription
1 SciDB: an open-source, array-oriented database management system Philippe Cudré-Mauroux Massachusetts Institute of Technology / University of Fribourg, Switzerland & the SciDB Team October 29, 2010 CIS, Itaipava, Brazil
2 Outline I. Context Exascale Data Deluge XLDB Workshops The SciDB Consortium II. System Overview Data Model Storage Model Operators Optimal Chunking Array Versioning SS-DB: a Science DBMS benchmark
3 Exascale Data Deluge Science Biology Astronomy Remote Sensing New data formats Peta & exa-scale data sets Web companies Ebay Yahoo Financial services, retail companies governments, etc. Wired 2009
4 Obsolete Infrastructures Data management infrastructures (DBMSs) were not prepared for this deluge One application One data type One user One CPU IBM Infrastructures collapse
5 Why is evolving DBMSs so hard? Two fundamental problems to tackle Obsolete physical model (impedance mismatch) N-ary storage, relational tuples B/R-trees SPJ queries Catastrophic performance in new application domains Impractical logical guarantees ACID properties Brewer s conjecture (CAP theorem) Inherent difficulty to scale-out
6 New Era of Data Management Users have gone away of DBMSs... File-centric solutions Procedural processing chains... and have come back Data abstractions (physical & logical) Strong consistency guarantees
7 XLDB Workshops Extremely Large Databases Workshops practical issues related to extremely large databases (1PB+) focus on escience: radio-astronomy, geoscience, genomics, particle physics invitation-only organized since 2007 by Becla & Lim (SLAC) Invited database researchers from the start consensus on array data model SciDB was born Open-source, array-oriented database system CIDR paper 2009 [S09]
8 The SciDB Consortium Distributed R&D Team Scientific Advisory Board Start-up (Zetics) Marylin Matz (CEO) & Paul Brown (CTO) Beta version demoed at VLDB 2009 [CM09] First public release (V0.5) available now
9 II. System Overview
10 SciDB Philosophy Array-Oriented Declarative Extensible Shared-Nothing Open-Source
11 High-Level Architecture!"#$%&'(()#"*+,- '(()#"*+,-&.*716.*-/0*/1&!(1"#2"&34 =0-+>1&!0(16?#<,6!16?16&.* &4-8169*"1&& *-:&;*6<16 A,:1&B A,:1&C A,:1&G!8,6*/1&.*716.,"*)&DE1"08,6!8,6*/1&F*-*/16
12 Data Model Nested, multidimensional arrays as first-class citizens Basic data types User-Defined Types Dimension 2 a 1 = 2 a 2 = 3.2f a 3 = (5,2) (0,0) Dimension 1
13 Storage Model (1/3) Vertical Partitioning Dimension 2 a 1 = 2 a 2 = 3.2f a 3 = (5,2) (0,0) Dimension 1 Dimension 2 Dimension 2 (0,0) (0,1) (0,2) (0,3) (0,0) Dimension 1 (0,0) Dimension 1 a1 a2 a3...
14 Storage Model (2/3) Co-location of values Adaptive chunking Dense-packing Chunks co-location Compression Dimension 2 (0,0) Dimension 1 Two on-disk representations Dense arrays extremely compact, no position, offsetting Sparse arrays compactly stores positions + values in array order
15 Storage Model (3/3) Co-location of nested arrays Nested array of rank M from parent array of rank N stored as array of rank (M+N) Dimension 2 (0,0) a 3 Dimension 1 Redundancy Array replication Chunking w/ overlap Parallelism overlap 1 chunk 1 overlap 2 chunk 2
16 Operators (1/2) Declarative system supporting an array-oriented query language (AQL) Three types of operators Structural (subsample, add dimension, concatenate etc.) Content-dependent (filter, select etc.) Both structural & content-dependent (joins)
17 Operators (2/2) Scatter/gather Generic operator to distribute & aggregate array chunks & values in distributed settings
18 The Array Query Language (AQL) CREATE ARRAY Test_Array! < A: integer NULLS,! B: double,! C: USER_DEFINED_TYPE >! [I=0:99999,1000, 10, J=0:99999,1000, 10 ]! PARTITION OVER ( Node1, Node2, Node3 )! USING block_cyclic();! attribute names index names chunk size overlap A, B, C I, J
19 The Array Query Language (AQL) SELECT Geo-Mean ( T.B )! FROM Test_Array T! WHERE! T.I BETWEEN :C1 AND :C2! AND T.J BETWEEN :C3 AND :C4! AND T.A = 10! GROUP BY T.I;! User-defined aggregate on an attribute B in array T Subsample Filter Group-by So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL
20 User-Defined Functions (UDFs) Extensibility through Postgres-style UDFs Encodes all domain-specific operations Coded in C++ Logical definition 1-N arrays-in, 1-N arrays-out Physical implementation chunk-granularity receives chunks from local query executor creates cell iterators, consumes values returns chunks to local executor
21 Parallel UDFs Regrid 2::1 UDF Worker 1 input pipeline... (16,0) (8,0) (4,0) Worker 2 output pipeline... (2,0) (0,0) (12,0) (6,0)
22 Demo preview of V node installation
23 Optimal Chunking How to chunk and index very large arrays? Chunk sizes & index size directly impact all array operations most operations highly IO-bound Multifaceted problem Chunks must be relatively large to amortize disk seeks, but not too large to fit in main memory an to avoid reading unnecessary data Data can be highly skewed Queries can be highly skewed Index must be compact and easily updatable
24 Spatial Partitioning (1/2) Cost-based, optimal spatial partitioning Efficient, hierarchical partitioning 24
25 Spatial Partitioning (2/2) Basic idea Cost-model for query execution times based on #cells accessed Optimal quadtree construction based on cost-model, query workload, local density & page size cellsize opt (Q, D, pagesize) [CM10a] (TrajStore) Optimal balance between Oversized cells potentially retrieves data that is not queried Undersized cells seek not amortized if too little data read unnecessary seeks if dense data and relatively large query 25
26 Versioning N-dimensional array objects Frequent insertions of new versions New observations New simulations Arbitrary queries over version history Q1 T1 T2 T3 Q2 T1 T2 T3
27 Problems w/ Versioning Systems Current Versioning Systems (e.g., SVN, GIT) are Slow for large / numerous binary files Catastrophic for spatial queries over series of versions No spatial chunking No co-location of versions?? T1 T2 T3
28 AVS Algorithms Given a workload of frequent / typical queries: Optimal co-location of (N+1) dimensional chunks on disk Optimal materialization / Δ-compression of versions to minimize Storage space Query execution time Δ M M Orders of magnitude faster than svn / git on scientific workloads
29 The SS-DB Benchmark Three main ways of building a scientific DBMS array simulation layer on top of relational DBMS MonetDB array executor on top of blobs Rasdaman native array system SciDB SS-DB [CM10b] Suggest realistic workload for escience Queries on raw data, derived data and ensemble data Understand the performance differences bw models 29
30 Contents Receive raw imagery data + metadata 2D images Load dataset Cook the data to find observations stars Group observations follow stars over space & time Execute a few queries on raw data as well as derived data 30
31 Query Workload Q1: Aggregation Q2: Recooking Q3: Regridding Q4: Aggregation Q5: Polygons Q6: Density Imagery Extract user-specified part of image Detect observations that match userspecifed criteria Q7: Group Centroids Q8 & Q9: Trajectories 31
32 Current Results 32
33 SciDB Roadmap R0.5 available: very early stage work-in-progress Basic functionality iquery interpreter Internal query language - AIL Small set of operators - create, aggregate, join, filter, subsample Minimal wiki-style documentation Breathes, crawls, hiccups R0.75 targeted for end-of-year AQL - the SQL-like array query language Error handling Scalable math operations Better documentation & stability R1.0 beta in April 2011 More functionally complete (UDFs, uncertainty, provenance et al) Robust, high performance 33
34 Research Agenda for XI Array Versioning Tests on very large data sets Forks Distributed versioning Data Lineage (Semantic Information) Arbitrary information stored in a triple-store Integration / performance? Automatic creation of workflow metadata Arbitrary AQL queries using metadata 34
35 References SciDB website: SSDB paper: [S09] Requirements for Science Data Bases and SciDB. Stonebraker, Becla, DeWitt, Lim, Maier, Ratzesberger, Zdonik. CIDR [CM09] A Demonstration of SciDB: A Science-Oriented DBMS. Cudré-Mauroux, Kimura, Lim, Rogers, Simakov, Soroush, Velikhov, Wang, Balazinska, Becla, DeWitt, Heath, Maier, Madden, Stonebraker, Zdonik. PVLDB [CM10a] TrajStore: An Adaptive Storage System for Very Large Trajectory Data Sets. Cudré-Mauroux, Wu, Madden. ICDE [CM10b] SS-DB: A Standard Science DBMS Benchmark. Cudré-Mauroux, Kimura, Lim, Rogers, Madden, Stonebraker, Zdonik. Submitted for publication. 35
Big Data and Big Analytics
Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture
One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008
One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone Michael Stonebraker December, 2008 DBMS Vendors (The Elephants) Sell One Size Fits All (OSFA) It s too hard for them to maintain multiple code
Arrays in database systems, the next frontier?
Arrays in database systems, the next frontier? What is an array? An array is a systematic arrangement of objects, usually in rows and columns Get(A, X, Y) => Value Set(A, X, Y)
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson
ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson ASTERIX What is it? It s a next generation Parallel Database System to addressing today s
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies
Review Data Warehousing CPS 216 Advanced Database Systems Data warehousing: integrating data for OLAP OLAP versus OLTP Warehousing versus mediation Warehouse maintenance Warehouse data as materialized
High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
In-Memory Data Management for Enterprise Applications
In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Data Warehouse: Introduction
Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Oracle8i Spatial: Experiences with Extensible Databases
Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction
The Cubetree Storage Organization
The Cubetree Storage Organization Nick Roussopoulos & Yannis Kotidis Advanced Communication Technology, Inc. Silver Spring, MD 20905 Tel: 301-384-3759 Fax: 301-384-3679 {nick,kotidis}@act-us.com 1. Introduction
Geodatabase Programming with SQL
DevSummit DC February 11, 2015 Washington, DC Geodatabase Programming with SQL Craig Gillgrass Assumptions Basic knowledge of SQL and relational databases Basic knowledge of the Geodatabase We ll hold
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.
Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems Impala Overview:
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Big Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected] Data Availability or Data Deluge? Some decades
Introduction to LSST Data Management. Jeffrey Kantor Data Management Project Manager
Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager LSST Data Management Principal Responsibilities Archive Raw Data: Receive the incoming stream of images that the Camera
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
SAP HANA SPS 09 - What s New? SAP HANA Scalability
SAP HANA SPS 09 - What s New? SAP HANA Scalability (Delta from SPS08 to SPS09) SAP HANA Product Management November, 2014 2014 SAP AG or an SAP affiliate company. All rights reserved. 1 Disclaimer This
<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features
1 Oracle SQL Developer 3.0: Overview and New Features Sue Harper Senior Principal Product Manager The following is intended to outline our general product direction. It is intended
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Telemetry Database Query Performance Review
Telemetry Database Query Performance Review SophosLabs Network Security Group Michael Shannon Vulnerability Research Manager, SophosLabs [email protected] Christopher Benninger Linux Deep Packet
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
How To Use Hadoop For Gis
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Agile Analytics on Extreme-Size Earth Science Data
Agile Analytics on Extreme-Size Earth Science Data COPERNICUS Big Data Workshop Brussels / BE, 2014-mar-14 Peter Baumann Jacobs University rasdaman GmbH [email protected] What are the Big
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki 30.11.2012
In-Memory Columnar Databases HyPer Arto Kärki University of Helsinki 30.11.2012 1 Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 2 Introduction The relational
A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory
A SciDB-based Framework for Efficient Satellite Data Storage and Query based on Dynamic Atmospheric Event Trajectory ABSTRACT Luboš Krčál Nanyang Technological University, Singapore Czech Technical University
Data-Intensive Science and Scientific Data Infrastructure
Data-Intensive Science and Scientific Data Infrastructure Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 13 April 2011 Overview Data-intensive science Publishing scientific
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Chapter 1: Introduction
Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases
White Paper February 2010. IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario
White Paper February 2010 IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario 2 Contents 5 Overview of InfoSphere DataStage 7 Benchmark Scenario Main Workload
Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm
Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional
Multi-dimensional index structures Part I: motivation
Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
James Serra Sr BI Architect [email protected] http://jamesserra.com/
James Serra Sr BI Architect [email protected] http://jamesserra.com/ Our Focus: Microsoft Pure-Play Data Warehousing & Business Intelligence Partner Our Customers: Our Reputation: "B.I. Voyage came
2 Associating Facts with Time
TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Big Data looks Tiny from the Stratosphere
Volker Markl http://www.user.tu-berlin.de/marklv [email protected] Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Daniel J. Adabi. Workshop presentation by Lukas Probst
Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015
In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems [email protected]
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料 美 國 13 歲 學 生 用 Big Data 找 出 霸 淩 熱 點 Puri 架 設 網 站 Bullyvention, 藉 由 分 析 Twitter 上 找 出 提 到 跟 霸 凌 相 關 的 詞, 搭 配 地 理 位 置
Oracle Spatial and Graph. Jayant Sharma Director, Product Management
Oracle Spatial and Graph Jayant Sharma Director, Product Management Agenda Oracle Spatial and Graph Graph Capabilities Q&A 2 Oracle Spatial and Graph Complete Open Integrated Most Widely Used 3 Open and
Data warehousing with PostgreSQL
Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience
Big Data Patterns. Ron Bodkin Founder and President, Think Big
Big Data Patterns Ron Bodkin Founder and President, Think Big 1 About Me Ron Bodkin Founder and President, Think Big I have 9 years experience working with Big Data and Hadoop. In 2010, I founded Think
Apache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
OLAP. Business Intelligence OLAP definition & application Multidimensional data representation
OLAP Business Intelligence OLAP definition & application Multidimensional data representation 1 Business Intelligence Accompanying the growth in data warehousing is an ever-increasing demand by users for
DATA WAREHOUSING AND OLAP TECHNOLOGY
DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are
SQL Server 2008 Performance and Scale
SQL Server 2008 Performance and Scale White Paper Published: February 2008 Updated: July 2008 Summary: Microsoft SQL Server 2008 incorporates the tools and technologies that are necessary to implement
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
