HathiTrust Research Center Architecture. Data subsystem

Size: px
Start display at page:

Download "HathiTrust Research Center Architecture. Data subsystem"

Transcription

1 HathiTrust Research Center Architecture Data subsystem

2 Web portal Desktop SEASR client CI logon (NCSA) Access control (e.g. Grouper) framework Non- consump?ve Data capsules Task deployment SEASR analy?cs service Meandre Orches- tra?on WSO2 registry - services, collec?ons, data capsule images Authorita?ve volume store (Cassandra) WSO2 Enterprise service bus Programma?c access (e.g., Bamboo) NCSA local resources University of Michigan Replicated volume stores Future Grid Penguin on Demand NCSA HPC resources Page/volume tree (file system) rsync HathiTrust corpus

3 Web portal Desktop SEASR client CI logon (NCSA) Access control (e.g. Grouper) framework Non- consump?ve Data capsules Task deployment SEASR analy?cs service Meandre Orches- tra?on WSO2 registry - services, collec?ons, data capsule images Authorita?ve volume store (Cassandra) WSO2 Enterprise service bus Programma?c access (e.g., Bamboo) NCSA local resources University of Michigan Replicated volume stores Future Grid Penguin on Demand NCSA HPC resources Page/volume tree (file system) rsync HathiTrust corpus

4 Web portal Desktop SEASR client CI logon (NCSA) Access control (e.g. Grouper) framework Non- consump?ve Data capsules Task deployment SEASR analy?cs service Meandre Orches- tra?on WSO2 registry - services, collec?ons, data capsule images Authorita?ve volume store (Cassandra) WSO2 Enterprise service bus Programma?c access (e.g., Bamboo) NCSA local resources University of Michigan Replicated volume stores Future Grid Penguin on Demand NCSA HPC resources Page/volume tree (file system) rsync HathiTrust corpus

5 Solr quick introduc?on Lucene is a high- performance, full- featured text search engine library Solr is a web service frontend to Lucene Index consists of documents and document consists of fields which are name/value pair

6 HTRC Solr Has both bibliographic informa?on and full- text OCR scan 29 fields volume ID,?tle, author, several reference IDs (ISBN, ISSN, callnumber, etc), and full text Basic search like term query, wildcard, fuzzy query, phrase query and range query: Example: OCR: war, search documents containing the word war in text Term Vector is enabled to get word frequency and offset for each word : Occurences posi?on and offset

7 Filtered Term Vector Default Term Vector is massive O(5MB) per volume Extremely slow response for mul?ple volumes We extended Solr to filter unwanted words to enhance response speed significantly. Reduced term vector to O(80KB) per volume.

8 Web portal Desktop SEASR client CI logon (NCSA) Access control (e.g. Grouper) framework Non- consump?ve Data capsules Task deployment SEASR analy?cs service Meandre Orches- tra?on WSO2 registry - services, collec?ons, data capsule images Authorita?ve volume store (Cassandra) WSO2 Enterprise service bus Programma?c access (e.g., Bamboo) NCSA local resources University of Michigan Replicated volume stores Future Grid Penguin on Demand NCSA HPC resources Page/volume tree (file system) rsync HathiTrust corpus

9 Ingest Procedure Use rsync to pull filesystem data from HT main collec?on. Too many small text files... Parse structural metadata (METS) ordering of page, page checksum (and verifica?on); some metadata stored to NoSQL. Analyze delta logs to push incremental changes to NoSQL store

10 Rsync root Split pairtree list Bib metdata Collec?on namespace 1 pairtree_root Collec?on namespace 2 pairtree_root HTRC Text Corpora Ingest Workflow pairtree pairtree HathiTrust (remote) HathiTrust Research Center (local) Rsync split pairtree list Parallel rsync of the rest using split tree list Push modified volume s from pairtree to nosql Update collec?ons list pairtree pairtree Delta logs Split pairtree list Bib metdata pairtree_root Collec?on namespace 1 pairtree_root Collec?on namespace 2 Cassandra nosql repository Rsync root

11 Web portal Desktop SEASR client CI logon (NCSA) Access control (e.g. Grouper) framework Non- consump?ve Data capsules Task deployment SEASR analy?cs service Meandre Orches- tra?on WSO2 registry - services, collec?ons, data capsule images Authorita?ve volume store (Cassandra) WSO2 Enterprise service bus Programma?c access (e.g., Bamboo) NCSA local resources University of Michigan Replicated volume stores Future Grid Penguin on Demand NCSA HPC resources Page/volume tree (file system) rsync HathiTrust corpus

12 NoSQL Repository U?lizing Cassandra as a storage space for our text collec?ons and related metadata Aggregates small texts Allows us to manage flexible schemas Key- value based column store Offers good scalability, redundancy, and performance

13 Cassandra Schema Key: (volume ID) metadata Inu /001 Inu /xxx Inu copyright public Page count 16 What s up doc? f Rabbits 7 aabbcc metadata Inu /001 Inu /xxx Inu copyright Page count In- copyright b!2b 6 7effdd A ques?on 10 deadbeef Each row represents a volume Row key is the volume ID Each row contains many columns First column contains metadata ahributes about the volume Each subsequent column family is a page, key is page ID Page- specific columns contain page s and metadata about the page

14 Cassandra Schema Key: (volume ID) metadata Inu /001 Inu /xxx Inu copyright public Page count 16 What s up doc? f Rabbits 7 aabbcc metadata Inu /001 Inu /xxx Inu copyright Page count In- copyright b!2b 6 7effdd A ques?on 10 deadbeef Pros Cons Works well for all access primi?ves Well organized metadata no repe??ons Volume level versioning could follow similar schema, but version number needs to be concatenated to volume ID for historical versions Subcolumn families cannot be indexed Extra metadata are picked up even when only page s are needed Must store historical versions of volumes as deltas; naïve transla?on of the above format to historical versioning would have high cost in space

HathiTrust Research Center Architecture Overview

HathiTrust Research Center Architecture Overview HathiTrust Research Center Architecture Overview Robert H. McDonald @mcdonald Execu've Commi-ee- HathiTrust Research Center (HTRC) Deputy Director- Data to Insight Center Associate Dean- University Libraries

More information

Search and Real-Time Analytics on Big Data

Search and Real-Time Analytics on Big Data Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its

More information

MBooks: Google Books Online at the University of Michigan Library

MBooks: Google Books Online at the University of Michigan Library MBooks: Google Books Online at the University of Michigan Library Phil Farber, Chris Powell, Cory Snavely University of Michigan Library Information Technology Architecture overview Four basic pieces:

More information

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Using RDBMS, NoSQL or Hadoop?

Using RDBMS, NoSQL or Hadoop? Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)

More information

Proposal for Persistent & Unique Entity Identifiers

Proposal for Persistent & Unique Entity Identifiers Proposal for Persistent & Unique Entity Identifiers Jacob Jett, Guangchen Ruan, Leena Unnikrishnan, Colleen Fallaw, Chistopher Maden, Tim Cole UNIVERSITY OF ILLINOIS GSLIS INDIANA UNIVERISTY Executive

More information

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS Copyright 2014 Splunk Inc. Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS Dritan Bi=ncka BD Solu=ons Architecture Disclaimer During the course of this presenta=on, we may make forward looking statements

More information

Configuring Apache HTTP Server as a Reverse Proxy Server for SAS 9.2 Web Applications Deployed on BEA WebLogic Server 9.2

Configuring Apache HTTP Server as a Reverse Proxy Server for SAS 9.2 Web Applications Deployed on BEA WebLogic Server 9.2 Configuration Guide Configuring Apache HTTP Server as a Reverse Proxy Server for SAS 9.2 Web Applications Deployed on BEA WebLogic Server 9.2 This document describes how to configure Apache HTTP Server

More information

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken Klik om de s+jl te bewerken Klik om de models+jlen te bewerken Tweede niveau Derde niveau Vierde niveau.nl ENTRADA Vijfde niveau CENTR-tech 33 November 2015 Marco Davids, SIDN Labs Wie zijn wij? Mijlpalen

More information

Two Projects, One Challenge: Common research data issues in MIREX and HTRC

Two Projects, One Challenge: Common research data issues in MIREX and HTRC Two Projects, One Challenge: Common research data issues in MIREX and HTRC Presented by J. Stephen Downie University of Illinois at Urbana Champaign Acknowledgements Most of today s slides are directly

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Chapter-15 -------------------------------------------- Replication in SQL Server

Chapter-15 -------------------------------------------- Replication in SQL Server Important Terminologies: What is Replication? Replication is the process where data is copied between databases on the same server or different servers connected by LANs, WANs, or the Internet. Microsoft

More information

McAfee VirusScan and epolicy Orchestrator Administration Course

McAfee VirusScan and epolicy Orchestrator Administration Course McAfee VirusScan and epolicy Orchestrator Administration Course Intel Security Education Services Administration Course Training The McAfee VirusScan and epolicy Orchestrator Administration course from

More information

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com 1 Katta & Hadoop Katta - Distributed Lucene Index in Production Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com foto by: belgianchocolate@flickr.com 2 Intro Business intelligence reports from

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Business Intelligence Overview. BW/BI Security. BW/BI Architecture. Business Explorer (BEx) BW/BI BEx Tools Overview. What is BEx?

Business Intelligence Overview. BW/BI Security. BW/BI Architecture. Business Explorer (BEx) BW/BI BEx Tools Overview. What is BEx? SAP Business Warehouse/Business Intelligence Reporting Business Explorer (BEx) Washington State HRMS Business Warehouse/Business Intelligence (BW/BI) BW/BI Power User Workshop Materials General Topics

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

Data analy(cs workflows for climate

Data analy(cs workflows for climate Data analy(cs workflows for climate Dr. Sandro Fiore Leader, Scientific data management research group Scientific Computing Division @ CMCC Prof. Giovanni Aloisio Director, Scientific Computing Division

More information

Structured Content: the Key to Agile. Web Experience Management. Introduction

Structured Content: the Key to Agile. Web Experience Management. Introduction Structured Content: the Key to Agile CONTENTS Introduction....................... 1 Structured Content Defined...2 Structured Content is Intelligent...2 Structured Content and Customer Experience...3 Structured

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Sitecore Best Practices -

Sitecore Best Practices - Sitecore Best Practices - War Stories From the Trenches Charles Turano Hedgehog Development @CharlesTurano Sitecore User Group Conference 2015 1 Hedgehog Sitecore User Group Conference 2015 2 Agenda What

More information

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch

More information

SAS BI Dashboard 4.3. User's Guide. SAS Documentation

SAS BI Dashboard 4.3. User's Guide. SAS Documentation SAS BI Dashboard 4.3 User's Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2010. SAS BI Dashboard 4.3: User s Guide. Cary, NC: SAS Institute

More information

The New ADS Search Interface and API

The New ADS Search Interface and API The New ADS Search Interface and API Alberto Accomazzi - @aaccomazzi for the ADS team - @adsabs 28 September 2013 IVOA Kona Saturday, September 28, 13 The ADS Classic System No frameworks available in

More information

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A Scalable Data Transformation Framework using the Hadoop Ecosystem A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional

More information

Workflow Solutions for Very Large Workspaces

Workflow Solutions for Very Large Workspaces Workflow Solutions for Very Large Workspaces February 3, 2016 - Version 9 & 9.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Oracle Database 11g: SQL Tuning Workshop

Oracle Database 11g: SQL Tuning Workshop Oracle University Contact Us: + 38516306373 Oracle Database 11g: SQL Tuning Workshop Duration: 3 Days What you will learn This Oracle Database 11g: SQL Tuning Workshop Release 2 training assists database

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

RS MDM. Integration Guide. Riversand

RS MDM. Integration Guide. Riversand RS MDM 2009 Integration Guide This document provides the details about RS MDMCenter integration module and provides details about the overall architecture and principles of integration with the system.

More information

SAS BI Course Content; Introduction to DWH / BI Concepts

SAS BI Course Content; Introduction to DWH / BI Concepts SAS BI Course Content; Introduction to DWH / BI Concepts SAS Web Report Studio 4.2 SAS EG 4.2 SAS Information Delivery Portal 4.2 SAS Data Integration Studio 4.2 SAS BI Dashboard 4.2 SAS Management Console

More information

Tivoli Enterprise Portal

Tivoli Enterprise Portal IBM Tivoli Monitoring Version 6.3 Tivoli Enterprise Portal User's Guide SC22-5447-00 IBM Tivoli Monitoring Version 6.3 Tivoli Enterprise Portal User's Guide SC22-5447-00 Note Before using this information

More information

NoSQL Roadshow Berlin Kai Spichale

NoSQL Roadshow Berlin Kai Spichale Full-text Search with NoSQL Technologies NoSQL Roadshow Berlin Kai Spichale 25.04.2013 About me Kai Spichale Software Engineer at adesso AG Author in professional journals, conference speaker adesso is

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

www.objectivity.com Choosing The Right Big Data Tools For The Job A Polyglot Approach

www.objectivity.com Choosing The Right Big Data Tools For The Job A Polyglot Approach www.objectivity.com Choosing The Right Big Data Tools For The Job A Polyglot Approach Nic Caine NoSQL Matters, April 2013 Overview The Problem Current Big Data Analytics Relationship Analytics Leveraging

More information

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1 Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots

More information

Installation & User Guide

Installation & User Guide SharePoint List Filter Plus Web Part Installation & User Guide Copyright 2005-2011 KWizCom Corporation. All rights reserved. Company Headquarters KWizCom 50 McIntosh Drive, Unit 109 Markham, Ontario ON

More information

these three NoSQL databases because I wanted to see a the two different sides of the CAP

these three NoSQL databases because I wanted to see a the two different sides of the CAP Michael Sharp Big Data CS401r Lab 3 For this paper I decided to do research on MongoDB, Cassandra, and Dynamo. I chose these three NoSQL databases because I wanted to see a the two different sides of the

More information

SIMPLY REPORTS DEVELOPED BY THE SHARE STAFF SERVICES TEAM

SIMPLY REPORTS DEVELOPED BY THE SHARE STAFF SERVICES TEAM SIMPLY REPORTS DEVELOPED BY THE SHARE STAFF SERVICES TEAM Winter 2014 TABLE OF CONTENTS Logging In and Overview...3 Steps to Creating a Report...7 Steps for Publishing a Report...9 Using Data from a Record

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

DNS Big Data Analy@cs

DNS Big Data Analy@cs Klik om de s+jl te bewerken Klik om de models+jlen te bewerken! Tweede niveau! Derde niveau! Vierde niveau DNS Big Data Analy@cs Vijfde niveau DNS- OARC Fall 2015 Workshop October 4th 2015 Maarten Wullink,

More information

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data David Minor 1, Reagan Moore 2, Bing Zhu, Charles Cowart 4 1. (88)4-104 minor@sdsc.edu San Diego Supercomputer Center

More information

Scalable Forensics with TSK and Hadoop. Jon Stewart

Scalable Forensics with TSK and Hadoop. Jon Stewart Scalable Forensics with TSK and Hadoop Jon Stewart CPU Clock Speed Hard Drive Capacity The Problem CPU clock speed stopped doubling Hard drive capacity kept doubling Multicore CPUs to the rescue!...but

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole Paper BB-01 Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole ABSTRACT Stephen Overton, Overton Technologies, LLC, Raleigh, NC Business information can be consumed many

More information

Using EMC Documentum with Adobe LiveCycle ES

Using EMC Documentum with Adobe LiveCycle ES Technical Guide Using EMC Documentum with Adobe LiveCycle ES Table of contents 1 Deployment 3 Managing LiveCycle ES development assets in Documentum 5 Developing LiveCycle applications with contents in

More information

INTRODUCTION TO CASSANDRA

INTRODUCTION TO CASSANDRA INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open

More information

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster. MongoDB 1. Introduction MongoDB is a document-oriented database, not a relation one. It replaces the concept of a row with a document. This makes it possible to represent complex hierarchical relationships

More information

P R O V I S I O N I N G O R A C L E H Y P E R I O N F I N A N C I A L M A N A G E M E N T

P R O V I S I O N I N G O R A C L E H Y P E R I O N F I N A N C I A L M A N A G E M E N T O R A C L E H Y P E R I O N F I N A N C I A L M A N A G E M E N T, F U S I O N E D I T I O N R E L E A S E 1 1. 1. 1.x P R O V I S I O N I N G O R A C L E H Y P E R I O N F I N A N C I A L M A N A G E

More information

Large Scale Text Analysis Using the Map/Reduce

Large Scale Text Analysis Using the Map/Reduce Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

SharePoint 2010 Performance and Capacity Planning Best Practices

SharePoint 2010 Performance and Capacity Planning Best Practices Information Technology Solutions SharePoint 2010 Performance and Capacity Planning Best Practices Eric Shupps SharePoint Server MVP About Information Me Technology Solutions SharePoint Server MVP President,

More information

SQL Anywhere 12 New Features Summary

SQL Anywhere 12 New Features Summary SQL Anywhere 12 WHITE PAPER www.sybase.com/sqlanywhere Contents: Introduction... 2 Out of Box Performance... 3 Automatic Tuning of Server Threads... 3 Column Statistics Management... 3 Improved Remote

More information

Real World Enterprise SQL Server Replication Implementations. Presented by Kun Lee sa@ilovesql.com

Real World Enterprise SQL Server Replication Implementations. Presented by Kun Lee sa@ilovesql.com Real World Enterprise SQL Server Replication Implementations Presented by Kun Lee sa@ilovesql.com About Me DBA Manager @ CoStar Group, Inc. MSSQLTip.com Author (http://www.mssqltips.com/sqlserverauthor/15/kunlee/)

More information

Openbus Documentation

Openbus Documentation Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:

More information

HADOOP MOCK TEST HADOOP MOCK TEST I

HADOOP MOCK TEST HADOOP MOCK TEST I http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

About this release. McAfee Application Control and Change Control 6.1.1. Addendum. Content change tracking. Configure content change tracking rule

About this release. McAfee Application Control and Change Control 6.1.1. Addendum. Content change tracking. Configure content change tracking rule Addendum McAfee Application Control and Change Control 6.1.1 About this release For use with epolicy Orchestrator 4.6 5.0 Software This document is an addendum to the McAfee Change Control and Application

More information

Time series IoT data ingestion into Cassandra using Kaa

Time series IoT data ingestion into Cassandra using Kaa Time series IoT data ingestion into Cassandra using Kaa Andrew Shvayka ashvayka@cybervisiontech.com Agenda Data ingestion challenges Why Kaa? Why Cassandra? Reference architecture overview Hands-on Sandbox

More information

Ligero Content Delivery Server. Documentum Content Integration with

Ligero Content Delivery Server. Documentum Content Integration with Ligero Content Delivery Server Documentum Content Integration with Ligero Content Delivery Server Prepared By Lee Dallas Principal Consultant Armedia, LLC April, 2008 1 Summary Ligero Content Delivery

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

www.basho.com Technical Overview Simple, Scalable, Object Storage Software

www.basho.com Technical Overview Simple, Scalable, Object Storage Software www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...

More information

Reporting Services. White Paper. Published: August 2007 Updated: July 2008

Reporting Services. White Paper. Published: August 2007 Updated: July 2008 Reporting Services White Paper Published: August 2007 Updated: July 2008 Summary: Microsoft SQL Server 2008 Reporting Services provides a complete server-based platform that is designed to support a wide

More information

McAfee Application Control / Change Control Administration Intel Security Education Services Administration Course

McAfee Application Control / Change Control Administration Intel Security Education Services Administration Course McAfee Application Control / Change Control Administration Intel Security Education Services Administration Course The McAfee University Application Control / Change Control Administration course enables

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Hitachi Storage Command Suite Logical Groups and LDEV Labels

Hitachi Storage Command Suite Logical Groups and LDEV Labels Hitachi Storage Command Suite Logical Groups and LDEV Labels September 10, 2008 John Harker Senior Product Marketing Manager Malcolm Muir Senior Competitive Analyst Hitachi Data Systems WebTech Series

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

CA DLP. Stored Data Integration Guide. Release 14.0. 3rd Edition

CA DLP. Stored Data Integration Guide. Release 14.0. 3rd Edition CA DLP Stored Data Integration Guide Release 14.0 3rd Edition This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to as the Documentation

More information

Ganzheitliches Datenmanagement

Ganzheitliches Datenmanagement Ganzheitliches Datenmanagement für Hadoop Michael Kohs, Senior Sales Consultant @mikchaos The Problem with Big Data Projects in 2016 Relational, Mainframe Documents and Emails Data Modeler Data Scientist

More information

PROCESSING AND VISUALIZING DATA IN A CONTINUOUS COLLECTION ENVIRONMENT BYRON ELLIS SPONGECELL, INC

PROCESSING AND VISUALIZING DATA IN A CONTINUOUS COLLECTION ENVIRONMENT BYRON ELLIS SPONGECELL, INC PROCESSING AND VISUALIZING DATA IN A CONTINUOUS COLLECTION ENVIRONMENT BYRON ELLIS SPONGECELL, INC AN OBSERVATION You know a lot about your compute environment! WE DON'T... Only vague information about

More information

K@ A collaborative platform for knowledge management

K@ A collaborative platform for knowledge management White Paper K@ A collaborative platform for knowledge management Quinary SpA www.quinary.com via Pietrasanta 14 20141 Milano Italia t +39 02 3090 1500 f +39 02 3090 1501 Copyright 2004 Quinary SpA Index

More information

BUSINESSOBJECTS DATA INTEGRATOR

BUSINESSOBJECTS DATA INTEGRATOR PRODUCTS BUSINESSOBJECTS DATA INTEGRATOR IT Benefits Correlate and integrate data from any source Efficiently design a bulletproof data integration process Accelerate time to market Move data in real time

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

DESYcloud: an owncloud & dcache update

DESYcloud: an owncloud & dcache update : an owncloud & dcache update Paul Millar (on behalf of team) : an owncloud & dcache update Cloud Services for Synchronisation and Sharing Zürich, Switzerland. 2016-01-18 2016-01-19 http://cs3.ethz.ch/

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Cloud Scale Distributed Data Storage. Jürmo Mehine

Cloud Scale Distributed Data Storage. Jürmo Mehine Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group A Tutorial Introduc/on to Big Data Hands On Data Analy/cs over EMR Robert Grossman University of Chicago Open Data Group Collin BenneE Open Data Group November 12, 2012 1 Amazon AWS Elas/c MapReduce allows

More information

DISK IMAGE BACKUP. For Physical Servers. VEMBU TECHNOLOGIES www.vembu.com TRUSTED BY OVER 25,000 BUSINESSES

DISK IMAGE BACKUP. For Physical Servers. VEMBU TECHNOLOGIES www.vembu.com TRUSTED BY OVER 25,000 BUSINESSES DISK IMAGE BACKUP For Physical Servers VEMBU TECHNOLOGIES www.vembu.com Copyright Information Information in this document is subject to change without notice. The entire risk of the use or the results

More information

Software Development and Deployment

Software Development and Deployment Software Development and Deployment PDS Management Council Face-to-Face Berkeley, California November 18-19, 2014 Sean Hardman Topics Overview Build 5a Deployment Status Repor:ng Build 5b Next Steps November

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Geodatabase Programming with SQL

Geodatabase Programming with SQL DevSummit DC February 11, 2015 Washington, DC Geodatabase Programming with SQL Craig Gillgrass Assumptions Basic knowledge of SQL and relational databases Basic knowledge of the Geodatabase We ll hold

More information

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00 Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

Oracle Warehouse Builder 10g

Oracle Warehouse Builder 10g Oracle Warehouse Builder 10g Architectural White paper February 2004 Table of contents INTRODUCTION... 3 OVERVIEW... 4 THE DESIGN COMPONENT... 4 THE RUNTIME COMPONENT... 5 THE DESIGN ARCHITECTURE... 6

More information

AlienVault Unified Security Management (USM) 4.x-5.x. Deployment Planning Guide

AlienVault Unified Security Management (USM) 4.x-5.x. Deployment Planning Guide AlienVault Unified Security Management (USM) 4.x-5.x Deployment Planning Guide USM 4.x-5.x Deployment Planning Guide, rev. 1 Copyright AlienVault, Inc. All rights reserved. The AlienVault Logo, AlienVault,

More information

Basics Of Replication: SQL Server 2000

Basics Of Replication: SQL Server 2000 Basics Of Replication: SQL Server 2000 Table of Contents: Replication: SQL Server 2000 - Part 1 Replication Benefits SQL Server Platform for Replication Entities for the SQL Server Replication Model Entities

More information

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson Texas Digital Government Summit Data Analysis Structured vs. Unstructured Data Presented By: Dave Larson Speaker Bio Dave Larson Solu6ons Architect with Freeit Data Solu6ons In the IT industry for over

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Pipeline Orchestration for Test Automation using Extended Buildbot Architecture

Pipeline Orchestration for Test Automation using Extended Buildbot Architecture Pipeline Orchestration for Test Automation using Extended Buildbot Architecture Sushant G.Gaikwad Department of Computer Science and engineering, Walchand College of Engineering, Sangli, India. M.A.Shah

More information

Driving MySQL to Big Data Scale. Thomas Hazel Founder, Chief Scien@st thomas@deepis.com

Driving MySQL to Big Data Scale. Thomas Hazel Founder, Chief Scien@st thomas@deepis.com Driving MySQL to Big Data Scale Thomas Hazel Founder, Chief Scien@st thomas@deepis.com Millions to Billions to Trillions Agenda Driving MySQL to Big Data Scale Market Trends Hardware Trends Current Computer

More information