BioInterchange 2.0 CODAMONO. Integrating and Scaling Genomics Data

Similar documents
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Certified Apache CouchDB Professional VS-1045

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Open Source Technologies on Microsoft Azure

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Microsoft Azure: Opção de Nuvem para Todo o Desenvolvedor. Danilo Bordini & Osvaldo Daibert

Cloud Scale Distributed Data Storage. Jürmo Mehine

How To Handle Big Data With A Data Scientist

Sentimental Analysis using Hadoop Phase 2: Week 2

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Challenges for Data Driven Systems

An Approach to Implement Map Reduce with NoSQL Databases

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Applications for Big Data Analytics

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

RED HAT SOFTWARE COLLECTIONS BRIDGING DEVELOPMENT AGILITY AND PRODUCTION STABILITY

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Why Spark on Hadoop Matters

Oracle Big Data Handbook

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

What s Cooking in KNIME

Big Data on Microsoft Platform

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Too Big To Ignore

Big Data and Data Science: Behind the Buzz Words

Assignment # 1 (Cloud Computing Security)

TRAINING PROGRAM ON BIGDATA/HADOOP

Data Analytics Infrastructure

Azure Data Lake Analytics

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Understanding NoSQL Technologies on Windows Azure

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Domain driven design, NoSQL and multi-model databases

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Syllabus INFO-GB Design and Development of Web and Mobile Applications (Especially for Start Ups)

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Gerald Kaszuba. Slowchop Studios Director Specialising in Game Design, Architecture, and Development.

Moving From Hadoop to Spark

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Architecting Open source solutions on Azure. Nicholas Dritsas Senior Director, Microsoft Singapore

Replicating to everything

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Oracle Big Data Management System

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Big Data Training - Hackveda

An Open Source NoSQL solution for Internet Access Logs Analysis

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Microsoft Azure Data Technologies: An Overview

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Architecting Applications to Scale in the Cloud

Tungsten Replicator, more open than ever!

How To Scale Out Of A Nosql Database

More Data in Less Time

Syllabus INFO-UB Design and Development of Web and Mobile Applications (Especially for Start Ups)

the missing log collector Treasure Data, Inc. Muga Nishizawa

Building Your Big Data Team

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Integrating Big Data into the Computing Curricula

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Open source software for building a private cloud

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data and Market Surveillance. April 28, 2014

Safe Harbor Statement

multiparadigm programming Multiparadigm Data Storage for Enterprise Applications

Déployer son propre cloud avec OpenStack. GULL François Deppierraz

Creating Big Data Applications with Spring XD

MySQL Strategy. Morten Andersen, MySQL Enterprise Sales. Copyright 2014 Oracle and/or its affiliates. All rights reserved.

INTRODUCING APACHE IGNITE An Apache Incubator Project

Transcription:

BioInterchange 2.0 Integrating and Scaling Genomics Data by CODAMONO www.codamono.com @CODAMONO +1 647 780 3927 5-121 Marion Street, Toronto, Ontario, M6R 1E6, Canada

Genomics Today: Big Data Challenge Genomic data genes, transcription factor binding sites, regulatory elements, proteins are available in abundance. World-wide data centers provide free access to genomic features and variations via FTP servers. The most commonly shared genomic data files are: GFF3, GVF and VCF. Genomic data standards have one common denominator: they are incompatible to each other. Another problem are complementary data representation choices across GFF3, GVF and VCF standards, which make it difficult to access data from multiple data providers. BioInterchange 2. 0 tackles genomics standards by overcoming differences in their specifications, by providing first-class data-access methods, and by enabling the integration of genomic data into modern database management systems. Compatibility MapReduce Shards Linked Data SciPy/NumPy SQL NoSQL Triple Stores Python API BioInterchange 2.0 Genomic Data Files

Unified Data Model Genomic file standards emerged as a data sharing solution, where large genome centers would offer GFF3, GVF and VCF files for download via FTP. Even today, data sharing via file transfer is the most prevalent method of obtaining genomics data. File are not a data model though: formatting, encoding and representation of data items differ between GFF3, GVF and VCF files. It takes tremendous efforts to integrate genomics data into a data model that is suitable for data analysis and data mining. BioInterchange 2.0 solves the data integration problem by providing a single JSON/JSON-LD data model for transparent data access and data processing. This makes it possible to work with genomics data independently of the originating file format. Of course, BioInterchange 2.0 also provides means to convert data in JSON/JSON-LD back into the original genomic file formats.

High-Performance API Genomics data is inherently big data. Efficient algorithms are required to produce results within acceptable time frames. BioInterchange 2.0 comes with a Python API for first-class data processing. Whether it be for data filtering, data analysis, or data annotation, BioInterchange 2.0 s high-performance API has a data throughput that is twice as fast as BioPython or BioRuby. Data Integration JSON has become a lingua franca not only in the web- and cloud-development world, but it is also ubiquitously supported by modern database management systems (DBMS). NoSQL DBMS such as MongoDB, RethinkDB, CouchDB, and ArangoDB use JSON at their core. Established SQL DBMS like MySQL and PostgreSQL provide native support for JSON now too. Search servers, for example Elasticsearch, or data warehouse centric systems build on top of Hadoop, such as Apache Hive and Apache Pig, handle JSON as well. BioInterchange 2.0 s discrete use of JSON-LD annotations enables data transformations into data formats of the Resource Description Framework (RDF) where necessary. This makes it possible to integrate genomics data into triple stores as well as other graph databases.

Information Sheet Requirements Mac OS X Yosemite (OS 10.10) or newer; or Linux, amd64 architecture Installation Packages CloudBioLinux (Linux) Homebrew Science (Mac OS X and Linux) DMG file (Mac OS X) Debian (jessie) package (Linux) Docker images on Docker Hub (Linux) Supported Data Formats Generic Feature Format Version 3 (GFF3) Genome Variation Format (GVF) Variant Call Format (VCF) JavaScript Object Notation (JSON) JavaScript Object Notation for Linked Data (JSON-LD) Software Interfaces Command line interface Python 3.4 Application Programming Interface (API) End-User License Agreements This is an overview only; complete Terms & Conditions are available online. Free Tier o Trial License: 30 days, non-renewable o Peer-Review License: 30 days, renewable Monthly Subscriptions o Individual License: personal use, non-transferable (United States/Canada only) o Team License: up to 5 simultaneous colocated users o Cloud, Cluster & HIPAA License: unlimited users, offline license verification