Data Formats and Databases. Linda Woodard Consultant Cornell CAC



Similar documents
SQL Server An Overview

Data-Intensive Science and Scientific Data Infrastructure

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

Geodatabase Programming with SQL

sqlite driver manual

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Writing MySQL Scripts With Python's DB-API Interface

1 File Processing Systems

MySQL Storage Engines

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Object Relational Database Mapping. Alex Boughton Spring 2011

Database System Architecture & System Catalog Instructor: Mourad Benchikh Text Books: Elmasri & Navathe Chap. 17 Silberschatz & Korth Chap.

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Using SQL Server Management Studio

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Web Development Frameworks

EMBL-EBI. Database Replication - Distribution

Using Python, Django and MySQL in a Database Course

database abstraction layer database abstraction layers in PHP Lukas Smith BackendMedia

Chapter 13. Introduction to SQL Programming Techniques. Database Programming: Techniques and Issues. SQL Programming. Database applications

High-Volume Data Warehousing in Centerprise. Product Datasheet

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Storing Measurement Data

Karl Lum Partner, LabKey Software Evolution of Connectivity in LabKey Server

Big Systems, Big Data

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Comparing SQL and NOSQL databases

Chapter 9 Java and SQL. Wang Yang wyang@njnet.edu.cn

How To Create A Table In Sql (Ahem)

Foundations of Business Intelligence: Databases and Information Management

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Optional custom API wrapper. C/C++ program. M program

How, What, and Where of Data Warehouses for MySQL

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Performance Tuning for the Teradata Database

FileMaker 11. ODBC and JDBC Guide

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Complex Data and Object-Oriented. Databases

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

FileMaker 12. ODBC and JDBC Guide

n Assignment 4 n Due Thursday 2/19 n Business paper draft n Due Tuesday 2/24 n Database Assignment 2 posted n Due Thursday 2/26

Exploring Microsoft Office Access Chapter 2: Relational Databases and Multi-Table Queries

Raima Database Manager Version 14.0 In-memory Database Engine

Intro to Databases. ACM Webmonkeys 2011

CitusDB Architecture for Real-Time Big Data

GIS Databases With focused on ArcSDE

Object Oriented Database Management System for Decision Support System.

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

Parallel I/O on JUQUEEN

Contents. 2. cttctx Performance Test Utility Server Side Plug-In Index All Rights Reserved.

CS 377 Database Systems SQL Programming. Li Xiong Department of Mathematics and Computer Science Emory University

Oracle Database 10g Express

Package sjdbc. R topics documented: February 20, 2015

How to Design and Create Your Own Custom Ext Rep

White paper FUJITSU Software Enterprise Postgres

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Week 1 Part 1: An Introduction to Database Systems. Databases and DBMSs. Why Use a DBMS? Why Study Databases??

Programming Languages & Tools

Course MIS. Foundations of Business Intelligence

Enhancing Access to Climate Model Metadata via a Web-Accessible Database. Alisha R. Fernandez

Bringing Big Data Modelling into the Hands of Domain Experts

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Software: Systems and Application Software

MS Access Lab 2. Topic: Tables

Conventional Files versus the Database. Files versus Database. Pros and Cons of Conventional Files. Pros and Cons of Databases. Fields (continued)

Structure? Integrated Climate Data Center How to use the ICDC? Tools? Data Formats? User

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Oracle Warehouse Builder 10g

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

Database Scalability and Oracle 12c

BarTender Integration Methods. Integrating BarTender s Printing and Design Functionality with Your Custom Application WHITE PAPER

FileMaker 13. ODBC and JDBC Guide

New Features in MySQL 5.0, 5.1, and Beyond

Foundations of Business Intelligence: Databases and Information Management

CS346: Database Programming.

Databases in Engineering / Lab-1 (MS-Access/SQL)

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

<no narration for this slide>

DATABASE MANAGEMENT SYSTEM

Your Best Next Business Solution Big Data In R 24/3/2010

Basic Unix/Linux 1. Software Testing Interview Prep

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

TIM 50 - Business Information Systems

What is Data Virtualization?

Hadoop and Map-Reduce. Swati Gore

Internals of Hadoop Application Framework and Distributed File System

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

SQL Server Database Coding Standards and Guidelines

Transcription:

Data Formats and Databases Linda Woodard Consultant Cornell CAC Workshop: Data Analysis on Ranger, January 19, 2012

How will you store your data? Binary data is compact but not portable Machine readable only Byte-order issues: big endian (IBM) vs. little endian (Intel) Formatted text is portable but not compact Need to know all the details of formatting to read the data 1 byte of ASCII text stores only a single decimal digit (~3 bits) Compression can help, but is slow and often impractical for large files Need to consider how data will be used Is portability an issue? Will your favorite analysis tools be able to read the data? Are there storage constraints? 1/19/2012 www.cac.cornell.edu 2

Data Preservation and Discovery NSF requires a data management plan with all grant proposals Metadata Formats used Data location Discovery and access plans https://confluence.cornell.edu/display/rdmsgweb/home Large Research Projects Personnel Long time horizons Distant collaborators Scientific data formats address some of these issues 1/19/2012 www.cac.cornell.edu 3

Hierarchical Scientific Data Formats Data Format Academic Discipline Parallel I/O Software Interfaces Comments HDF5 2D and higher dimensional data yes C, C++, Fortran, Java, Python, Perl, IDL, Matlab, Mathematica NetCDF Earth Sciences yes C, C++, Fortran, Java, Python, Perl, Ruby, IDL, R, Matlab, ArcGIS developed at NCSA developed at UCAR FITS Astrophysics no C, C++, Fortran, Java, Python, Perl, IDL, R, Matlab, Mathematica developed at NASA Silo General Visualization yes VisIt developed at LLNL 1/19/2012 www.cac.cornell.edu 4

Scientific Data Formats: HDF5 Versatile data model that can represent complex data objects and metadata Portable file format with no limit on the number or size of data objects Open software library that runs on platforms from laptops to massively parallel systems Integrated performance features that optimize access time and storage space Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection Source: www.hdfgroup.org/hdf5 1/19/2012 www.cac.cornell.edu 5

HDF5 Features Headers include extensive metadata (datatypes, dimensionality, storage layout); files are self-documenting Virtual file layer provides flexible storage and transfer capabilities: Standard (Posix), Parallel, and Network I/O file drivers Compression & chunking increase access and storage efficiency Datatype transformations can be performed during I/O operations Subsetting reduces transferred data volume & improves access speed during I/O operations Source: www.hdfgroup.org/hdf5 1/19/2012 www.cac.cornell.edu 6

Scientific Data Formats: netcdf Similar to HDF5; newest version uses the HDF5 format Used extensively in the Earth Sciences community for time varying geospatial data; most data from NOAA is in netcdf format NetCDF has good tools for geo-gridded data Panoply (http://www.giss.nasa.gov/tools/panoply/) focuses on the presentation of geo-gridded data. Ferret (http://ferret.wrc.noaa.gov/ferret/) offers a Mathematica-like approach to analysis. Variables and expressions may be defined interactively; calculations may be applied over arbitrarily shaped regions; geophysical formatting is built in. Parallel-NetCDF (http://trac.mcs.anl.gov/projects/parallel-netcdf/) is built upon MPI-IO to distribute file reads and writes efficiently among multiple processors. 1/19/2012 www.cac.cornell.edu 7

netcdf Features Self-Describing files include information about the data they contain Portable endian problems handled automatically Direct-access subsets of a larger dataset can be accessed without reading through all the preceding data Appendable data may be appended to a properly structured netcdf file without copying the dataset or redefining its structure Shareable one writer and multiple readers may simultaneously access the same netcdf file Archivable netcdf will always be backwards compatible Source: http://www.unidata.ucar.edu/software/netcdf 1/19/2012 www.cac.cornell.edu 8

Scientific Data Formats: Silo Silo is a library for reading and writing scientific data to binary disk files Silo supports point meshes, structured and unstructured meshes in 2D and 3D Two layers API with Fortran, C, and Python interfaces I/O driver (HDF5 is one of these drivers) Primary file format for VisIt https://wci.llnl.gov/codes/silo/index.html 1/19/2012 www.cac.cornell.edu 9

Why Parallel I/O is important P0 P1 P2 P3 P0 may become bottleneck System memory may be exceeded on P0 I/O lib File system 1/19/2012 www.cac.cornell.edu 10

Why Parallel I/O is important part 2 P0 P1 P2 P3 Possible to achieve good performance I/O lib I/O lib I/O lib I/O lib May require post-processing More work for applications, programmers File system 1/19/2012 www.cac.cornell.edu 11

Why Parallel I/O is important part 3 P0 P1 P2 P3 HDF5, netcdf and Silo can take the place of a parallel I/O library - they ve linked the parallel I/O library for you Parallel I/O lib (based on MPI-IO) Parallel file system Variant: only P1 and P2 act as parallel writers; they gather data from P0 and P3 respectively (chunking) 1/19/2012 www.cac.cornell.edu 12

Scientific Data Formats: Scientific Databases What is a database? from Wikipedia an organized collection of data When to use a database Data hierarchy more complicated than space/time dimensions Database added value Built-in data integrity checks Management of row duplication Enforcement of data ranges and types Enforces planning about the data to be stored Data types (integer, decimal, datetime), scale, precision Missing data (null values) Scalability 1/19/2012 www.cac.cornell.edu 13

Scientific Databases vs. Hierarchical Data Formats Academic Disciplines All with any kind of hierarchical data Parallel I/O Available from many commercial vendors and open sources Software interfaces to SQL databases C, C++, C#, Fortran, Java, Python, Perl, Ruby, IDL, R, Matlab, ArcGIS, Excel Advanced query capabilities Fine grained ability to extract subsets of the data efficiently 1/19/2012 www.cac.cornell.edu 14

Relational Database Software Enterprise-class relational database systems Oracle Microsoft SQL Server MySQL PostgreSQL Small, light-weight relational database systems SQLite (C based) SmallSQL (Java based) Apache Derby (Java based) Gadfly (Python based) 1/19/2012 www.cac.cornell.edu 15

Real life example A Facebook application that allows people to show their arxiv.org papers on their Facebook profile page The application needs to store information about papers so that it can extract these papers based on queries about authors Conceptually we have a couple of objects we want to connect: Authors (Facebook ID, arxiv info, etc.) Papers (title, abstract, journal reference, etc.) Two approaches to this problem 1/19/2012 www.cac.cornell.edu 16

Approach one Delimited flat file Add a row to the table for every unique paper an author has written Search the table for all rows that have the appropriate ID Benefits: Easy to add new entries (depending on sorting); harder with more entries Simple to code the read/write functions Problems: Facebook ID Paper ID Paper Title Paper Authors 2341234 http://arxiv.org/abs/5234 Particle Pleasantry CH Foo 2341234 http://arxiv.org/abs/3234 Reading is neat CH Foo, RG Fields 1234123 http://arxiv.org/abs/4321 Science in Teaching DS Henry, RG Fields 1234123 http://arxiv.org/abs/3234 Reading is neat CH Foo, RG Fields 12341345 http://arxiv.org/abs/4321 Science in Teaching DS Henry, RG Fields A row is duplicated for every author of a paper (e.g., Reading is neat) To match a given Facebook ID, a linear scan of the entire file is required 1/19/2012 www.cac.cornell.edu 17

Approach two Relational database Define two tables Users/Authors table and Papers table Link them together by the paperid PaperID in Papers is a Primary Key; it uniquely identifies a row PaperID in Users is a Foreign Key; it points to a row in another table Benefits Fast retrevial of papers for an author Easy to add fields to the Users and Papers tables Problems Database management 1/19/2012 www.cac.cornell.edu 18

More relationships We can create relationships that are much more fined-grained Make the Users table more general Create a separate Authors table and a Reviewers table for reviews 1/19/2012 www.cac.cornell.edu 19

Relational databases Relational databases are based on the relational model--data is expressed by a set of binary relationships Flat files would replicate columns of metadata for each row The replication gets worse when the metadata is hierarchical 1/19/2012 www.cac.cornell.edu 20

Flat file or database? Flat files are useful for Small datasets Static dumping of data Databases are useful for Evolving data Data where searching/querying is important/complex Expressing relationships that are not captured in a row-based table Other factors to consider: Database overhead Expectations about sharing data 1/19/2012 www.cac.cornell.edu 21

Interacting with a database SQL SQL Structured Query Language A programming language designed for the creation, management, modification and retrieval of data from a database All databases speak SQL, though many also provide non-standard extensions Using a database requires a basic knowledge of SQL Designing a database requires extensive knowledge of SQL PL/SQL and SQL/PSM Database extensions for creating stored procedures 1/19/2012 www.cac.cornell.edu 22

SQL language Select, Where Select & Where control what subset of data to obtain from the database Retrieve sensor locations SELECT XCoordinate, YCoordinate FROM SensorLocation Retrieve sensor data for a one sensor SELECT * FROM SensorData WHERE SensorID = 200 Retrieve sensor data for a one time period SELECT * FROM SensorData WHERE ReadTime = 1/1/2012 1/19/2012 www.cac.cornell.edu 23

SQL language Join Use Join to retrieve data from more than one table Retrieve sensor data with locations SELECT SensorID, ReadTime, Measurement, Xcoordinate, YCoordinate FROM SensorData SD INNER JOIN SensorLocation SL ON SD.SensorLocationID = SL.SensorLocationID Retrieve sensor data with sensor types SELECT SD.SensorID, ReadTime, Measurement, SensorValue FROM SensorData SD INNER JOIN Sensors S ON SD.SensorID = S.SensorsID INNER JOIN SensorType ST ON S.SensorTypeID = ST.SensorTypeID 1/19/2012 www.cac.cornell.edu 24

SQL language insert, update, commit New rows can be inserted and existing rows can be updated Insert a new sensor INSERT INTO SensorType (SensorTypeID, SensorTypeDescription, SensorValue) Values(1001, Heat Sensor,20) INSERT INTO Sensors Values(89019, 1001, Cayuga ) Update the sensor data UPDATE Sensors SET SensorName = Cayuga Lake WHERE SensorID=89019 Commit the transactions COMMIT 1/19/2012 www.cac.cornell.edu 25

SQL language delete, order, etc. Dropping rows from a table mirrors SELECTing Delete data for a particular day DELETE FROM SensorData WHERE ReadTime = 1/1/2011 No assumptions can be made about the order that rows are retrieved Sort rows by location and within location by time SELECT * FROM SensorData ORDER BY SensorLocationID, ReadTime 1/19/2012 www.cac.cornell.edu 26

Using SQL in application code Most programming languages have SQL interfaces which are software modules that provide a connection to the database and a cursor import MySQLdb conn = MySQLdb.connect(host= h, user= u, passwd= p321, db= test ) cursor = conn.cursor() cursor.execute( SELECT * FROM SensorData WHERE SensorID = 1234 ) row = cursor.fetchone() cursor.execute ( SELECT * FROM SensorLocations ) row = cursor.fetchall() cursor.close() conn.close() Note: Many interfaces have a special executequery function which returns an iterable to retrieve rows (res->next()) 1/19/2012 www.cac.cornell.edu 27

SQL language assessment Benefits: SQL is a relatively simple language Interfaces exist from many programming languages to every type of database; all reasonable databases support SQL; therefore SQL is a ubiquitous choice Lines of code can be drastically reduced by taking advantage of powerful SQL commands for searching and retrieving objects from a database Problems: SQL queries can be amazingly inefficient ; there are tools for optimization Another language to learn 1/19/2012 www.cac.cornell.edu 28

Object-relational mapping Object-relational mapping (also ORM and O/R mapping) converts data between a database and an object-oriented programming language An ORM tool lets you create and use a database within a standard OO programming paradigm Database tables are created from class definitions SQL queries are basically written for you by the tool The ORM tools also allow you direct SQL access where optimized queries are needed 1/19/2012 www.cac.cornell.edu 29

OR mapping Script to create database tables Begin; CREATE TABLE SensorType ( SensorTypeID integer NOT NULL PRIMARY KEY, SensorTypeDescription varchar(100) NOT NULL, SensorValue integer NOT NULL); CREATE TABLE Sensors ( SensorID integer NOT NULL PRIMARY KEY, SensorTypeID integer NOT NULL, SensorName varchar(100) NOT NULL); CREATE TABLE SensorLocation ( SensorLocationID integer NOT NULL PRIMARY KEY, XCoordinate integer NOT NULL, YCoordinate integer NOT NULL, Online integer NOT NULL;) CREATE TABLE SensorData ( SensorID integer NOT NULL, SensorLocationID integer NOT NULL, RealTime datetime NOT NULL, Measurement integer NOT NULL;) ALTER TABLE SensorData ADD CONSTRAINT SensorID_refs_id FOREIGN KEY ( SensorID REFERENCES Sensors ( SensorID ); COMMIT 1/19/2012 www.cac.cornell.edu 30

OR mapping data structure The structure of the data is specified using classes and member variables Null-ability, default values, primary and foreign keys are easily specified Specify the primary key (otherwise one is generated) This means varchar(100) class SensorType (Model): SensorTypeID = models.integerfield (primary_key =True) SensorTypeDescription = models.charfield(max_length=100, null =False) SensorValue = models.integerfield (null=false) 1/19/2012 www.cac.cornell.edu 31

OR mapping programming The notation for dealing with an OR-mapped version is relatively simple but has several important features Transactions/sessions are managed by the mapper Type checking is enforced by the language rather than at runtime by SQL Changing data tables means changing the class structure 1/19/2012 www.cac.cornell.edu 32

Summary databases Databases can be an effective way to improve your ability to share and manage your data. Databases and database technologies are increasingly embedded in a variety of systems and technology stacks to support easy use of these systems are increasingly omnipresent. Database languages and tools can help reduce the amount of code you manage in your projects. 1/19/2012 www.cac.cornell.edu 33