Data Mining Systems Development. Arno Siebes
|
|
- Alaina McDonald
- 7 years ago
- Views:
Transcription
1 Data Mining Systems Development Arno Siebes
2 A Variety of Systems Data mining systems development depends on why you develop the system in the first place: Mining Research Systems, e.g., for: - algorithmic research - systems research Applied Research, e.g., tools for bioinformatics Commercial use, e.g., - horizontal systems, i.e., a system that supports a wide variety of algorithms and (data mining) problems - vertical systems, i.e., a system that is completely targeted towards a specific (business!) problem These different applications, obviously, share requirements on the system but also have different requirements. Where they share requirements, the degrees and the priorities can be rather different. Let us list and contrast these requirements. Arno Siebes Data Mining Systems Development (page 2 of 27)
3 Algorithms The need for algorithms varies with the application: Many and easily extended pure research: extensibility is crucial for algorithmic research horizontal commercial systems: under the precondition of manageability. Specially tailored vertical commercial systems: often even so far as having parameters fixed in an automation phase. applied research: clearly, but extensibility is also important. Extensibility can be achieved in various ways: a monolithic system; it means lots of code redundancy and code management becomes cumbersome a plug and play approach using code from various sources and API s (the Kepler approach); a class library approach, only suitable for well-trained programmers, but allows for lots of code sharing (the MLC++ approach) a components based approach: decomposing algorithms in: representation language, search operators, search strategies and quality functions (the Keso approach). Arno Siebes Data Mining Systems Development (page 3 of 27)
4 Data Management: storage The data has to reside somewhere, it could be: a file system: algorithmic research a dbms within the system: note that the intended difference between a file system and a dbms is the presence of search accelerators (indices), special operators and the support of complex data structures. systems research: you want access to all aspects that govern scalability applied research: the data doesn t easily fit into standard (relational) dbmss. a commercial dbms: commercial systems: that is where the data resides! Note that data import facilities into an internal system may speed up processing considerably, but it may not be what your client actually wants. Arno Siebes Data Mining Systems Development (page 4 of 27)
5 KDD Process Support We all know that data analysis is more than just running algorithms on a data set, there is lots of pre and post processing: Data Preparation Data Selection Data Cleaning Data Transformations Experimental Set Up Random subset selection Cross validation Post Processing Result inspection (e.g., visualization) Result exportation Arno Siebes Data Mining Systems Development (page 5 of 27)
6 Data Preparation There are various ways in which one can prepare data: Unix Tools and Scripting Languages algorithmic research: all the flexibility you need systems research: well, that is if there are no interesting research questions involved SQL applied research: if they are well trained in such languages horizontal systems: if they are well trained in such languages Visual Tools in the Interface commercial systems: the other possibilities are simply too difficult for most users applied research and systems research: there are interesting research questions! Arno Siebes Data Mining Systems Development (page 6 of 27)
7 Experimental Set Up Who needs an experimental set up? You may think everyone, but that isn t true Who doesn t? Why? in vertical applications there may be a hidden experimental set up, but the user should not be bothered by it. the users are most probably simply not smart enough to know what they are doing the users won t have the resources to do this Does this mean you don t test? (Shudder!) not necessarily: the system may have been tuned beforehand, while the results are monitored by experts. Arno Siebes Data Mining Systems Development (page 7 of 27)
8 Post Processing Post processing needs vary: None: algorithmic research: the results are only used in the validation of the algorithm Result Inspection: systems research: interesting research questions, e.g., how to visualize sets of association rules applied research and commercial systems: decisions will be made on the basis of these results, thus thorough understanding is necessary Result Exportation: horizontal systems: at least some XML export facility vertical systems: seamless integration with other products. Arno Siebes Data Mining Systems Development (page 8 of 27)
9 Developing Systems: my experiences and plans In the remainder of this talk, I will discuss some of my experiences and plans: The Keso system: system development research some algorithmic research goal a (modest) horizontal system Data Surveyor at Data Distilleries parallel with the Keso development original goal: a horizontal system products: vertical systems for analytical CRM The future: applied research: bioinformatics system development research to support this work Arno Siebes Data Mining Systems Development (page 9 of 27)
10 The Keso System The Keso system is a client server system: Client: Server user interface mining engine mining server (dbms) caching facilities We ll discuss briefly the mining engine and the mining server aspects of Keso Arno Siebes Data Mining Systems Development (page 10 of 27)
11 The Mining Engine One of the aspects of the KESO data mining kernel architecture is that each data mining algorithm consists of four components: 1. A model representation language. 2. A (local) quality function: which of these models fit the database best? 3. A search strategy: exhaustive or heuristics driven search for good/best models 4. Search operators: that define how a search strategy goes through the space of all possible models Arno Siebes Data Mining Systems Development (page 11 of 27)
12 The Implementation In the architecture this is reflected as follows: Search Man. Search Space Maintainer Qual. C. Model Gen Data Base Arno Siebes Data Mining Systems Development (page 12 of 27)
13 Components Search Manager: Contains a number of Search Modules. Each implements a Search Strategy Description Generator: Contains a number of operator modules. Each module implements an operator on a specific description language. Quality Computer: Implements a number of quality function modules and has a separate module to query the database for cross-tables. Arno Siebes Data Mining Systems Development (page 13 of 27)
14 Search Space Manager: All communication between the components is through records in the Search Space Manager. Some fields in these records are: An oid (logical name of ϕ) The description (ϕ itself) The quality of ϕ The operator o with which ϕ is constructed The parent model, i.e., ψ, with o(ψ) = ϕ While the search space is explored, these records are stored in a database. Arno Siebes Data Mining Systems Development (page 14 of 27)
15 Lessons Learned: The Search Space Manager Why did we decide to store the explored part of a search space? it allows users to see the route that the (heuristic) search has taken and explore other avenues from arbitrary points in that space it can be used as a caching mechanism, for points that you already visited you don t have to repeat the quality computation. Caching: it is a nice idea, but not that practical. The search space grows quickly. Maintaining a good index structure is expensive and linear search becomes expensive quickly as well Conclusion: it is better to speed-up the quality computation (i.e., database access). Exploring the Search Space: this is a good idea, but requires good user interface support. Experience learns that it is not necessary (not to say an overkill) to store all points, it is far better just to store those points that were considered interesting during the search and do some recomputations if necessary. Arno Siebes Data Mining Systems Development (page 15 of 27)
16 Lessons Learned: The Components Approach Why did we choose for a components based approach? extensibility through component sharing localised database access (quality computation) Extensibility: search strategies: clearly true operators and quality functions: depend strongly on the modelling language. Flexibility within one language is good. Localised database access: good results, especially because of uniform database access (through data cubes). This allows for all kinds of optimization. Arno Siebes Data Mining Systems Development (page 16 of 27)
17 The Mining Server At CWI we have a reasonably powerfull SGI Origin 2000 SMP machine: /300 MHz MIPS and 64GB main memory. It is nice to exploit such hardware fro the mining server. The possibilities lie in a few aspects of the Keso system: KESO use of the main memory DBMS Monet (a CWI/ UvA research prototype) The fact that the search of many data mining algorithms has a zooming type of behaviour The fact that quality functions are based on relatively simple aggregates, such as count and sum The fact that these aggregate queries are submitted in batches of related queries. Arno Siebes Data Mining Systems Development (page 17 of 27)
18 Monet Monet is a main memory DBMS, i.e., it is assumed that the database hotspot will always fit in main memory. An important aspect for us is that the data is stored as columns (per Attribute) rather than as rows (per tuple) as is customary. Each attribute is stored in a BAT (binary association table) as oid-value pairs. Data mining algorithms tend to investigate mostly one and at most a few attributes at the same time. In the case of the classification trees, one considers the qualities of the splits per attribute: this ensures that the hotspot fits it allows parallelization: - distribute the attributes over the processors - distribute each attribute over the processors Arno Siebes Data Mining Systems Development (page 18 of 27)
19 Zooming Consider again the construction of a classification tree. Each node n in the tree describes a subset s n of the database: If n is a child of m, s n is a subset of s m In other words, if we store the select set of the parent, we only have to scan that part of the database for the evaluation of the candidate sons: this requires the extra storage of one column only the optimization in time is exponential in the depth of the tree. Similar observations hold for many other mining algorithms. Arno Siebes Data Mining Systems Development (page 19 of 27)
20 The Aggregates In a client-server architecture (such as KESO), it is beneficial if the aggregates on which the quality functions are based are computed in the server rather than in the client. From a parallelization point of view, these aggregates have the nice property that they distribute: the count of the whole select set is the sum of the counts of a partition of the select set the sum is the sum of the sums Arno Siebes Data Mining Systems Development (page 20 of 27)
21 Multi-query optimization Traditionally, a DBMS optimizes each query posed to it separately. It does not take other queries in the system into account. While constructing a classification tree, we consider all possible extensions at the same time. This leads to a batch of strongly related aggregate queries to the database. Optimizing the combined set of queries (which is a.o., possible because of the distributive nature of the aggregates) leads to far better response times than the set of optimized queries. To do this, a special optimizer (called the MIL squeezer) has been implemented in Monet. Note, normal query optimization is still very much part of the process, of course. Arno Siebes Data Mining Systems Development (page 21 of 27)
22 Lessons learned Using our own DBMS was a major asset in the work: all unnecessary components such as locking and transaction amangement in general could be left out. operators such as the datacube could be implemented directly in the core of the DBMS all optimization tricks could be built in directly. all of this results in far greater speed than commercial platforms such as Oracle can deliver. Moreover, it means that new data structures etc, are easily supported. In particular, joint work with Arno Knobbe (syllogic) shows that the whole framework can be easily generalized for multi-relational mining. If you want a drawback: stability. Arno Siebes Data Mining Systems Development (page 22 of 27)
23 Experiences at DD As noted before, the development at Data Distilleries went in parallel with the development of the KESO system (in which they were a part of the team together with UH and GMD) Their technical experiences are in part what I ve told you. But there are some interesting business experiences: User Interface: the importance of this aspect cannot be overestimated. The interface should support the whole process in an intuitive fashion. E.g., SQL is far too difficult for many users. Server: exploiting an internal DBMS is ok, but you should import data from a wide variety of sources. speed is not that important, other aspects dominate Extensibility: Most clients are interested in solving business problems, i.e., require vertical systems. It is not the range of algorithms but the suitability of the algorithm for that purpose that counts. This also means that exporting results has to integrate with other products. Arno Siebes Data Mining Systems Development (page 23 of 27)
24 The Future I have learned a lot from the development of the Keso system, and I will start developing a new one at Utrecht University. again we will use Monet again we will think components based But, there is also going to be a difference: it is going to be an applied research vehicle for bioinformatics (in particular for genomics and proteomics) That is, it is going to be biologists workbench: including a dbms for biological data supporting mining of biological data Arno Siebes Data Mining Systems Development (page 24 of 27)
25 Biomolecular Data Biomolecular data is far from straight forward, some main reasons are: During inheritance DNA sequences are changed (both through cross-over and mutation). only parts of DNA strings encode for genes that may be turned on or off or even be silenced. The rest is called junk DNA... from the transcribed RNA large parts are introns that are subsequently removed only part of the mrna string is translated into protein because of the redundancy of the genetic code, different mrna substrings may yield the same protein. subtly different protein molecules may act completely interchangeable, because large parts of the protein simply guarantee the required geometry of the more active parts of the molecule. Arno Siebes Data Mining Systems Development (page 25 of 27)
26 Searching Because of this approximate nature of the strings, searching databases is not: the standard database equality search not even exact string matching It is an alignment problem with penalty functions for the errors There are a number of algorithms for this problem: using on dynamic programming: find the best (partial) alignment between two strings using Hidden Markov Models: how likely is one string produced by a machine that could have produced another string? Moreover, such algorithms have been transfered to a database setting in programs as BLAST and FASTA However, these tricks are far from the regular database approach, i.e., indices. One of the goals will be to built index structures in Monet targeted towards alignment searches. Arno Siebes Data Mining Systems Development (page 26 of 27)
27 Mining Biomolecular data is a paradise for data miners. There is not enough theory to attack the (enormous amounts of) data in a traditional way. Examples of questions are: discovering genes (there are approaches using HMMs, but more traditional techniques such as classification trees are also used). building philo-genetic trees (i.e, tracing back evolution) function prediction for proteins uncovering metabolic pathways Arno Siebes Data Mining Systems Development (page 27 of 27)
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
More informationAlgorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
More informationTopics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
More informationBioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationBioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
More informationSQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
More informationREGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
305 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
More information1 File Processing Systems
COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.
More informationA Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationHidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
More informationObjectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation
Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationData Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
More informationRNA Structure and folding
RNA Structure and folding Overview: The main functional biomolecules in cells are polymers DNA, RNA and proteins For RNA and Proteins, the specific sequence of the polymer dictates its final structure
More informationORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process
ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced
More informationThe Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
More informationSGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
More informationOLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP
Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key
More informationConcepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches
Concepts of Database Management Seventh Edition Chapter 9 Database Management Approaches Objectives Describe distributed database management systems (DDBMSs) Discuss client/server systems Examine the ways
More informationSELF-SERVICE ANALYTICS: SMART INTELLIGENCE WITH INFONEA IN A CONTINUUM BETWEEN INTERACTIVE REPORTS, ANALYTICS FOR BUSINESS USERS AND DATA SCIENCE
SELF-SERVICE BUSINESS INTELLIGENCE / INFONEA FEATURE OVERVIEW / SELF-SERVICE ANALYTICS: SMART INTELLIGENCE WITH INFONEA IN A CONTINUUM BETWEEN INTERACTIVE REPORTS, ANALYTICS FOR BUSINESS USERS AND DATA
More informationBLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationbigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
More information14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
More informationTertiary Storage and Data Mining queries
An Architecture for Using Tertiary Storage in a Data Warehouse Theodore Johnson Database Research Dept. AT&T Labs - Research johnsont@research.att.com Motivation AT&T has huge data warehouses. Data from
More informationPeopleTools Tables: The Application Repository in the Database
PeopleTools Tables: The Application Repository in the Database by David Kurtz, Go-Faster Consultancy Ltd. Since their takeover of PeopleSoft, Oracle has announced project Fusion, an initiative for a new
More informationREGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])
820 REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) (See also General Regulations) BMS1 Admission to the Degree To be eligible for admission to the degree of Bachelor
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationAPP INVENTOR. Test Review
APP INVENTOR Test Review Main Concepts App Inventor Lists Creating Random Numbers Variables Searching and Sorting Data Linear Search Binary Search Selection Sort Quick Sort Abstraction Modulus Division
More informationIn-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
More informationOracle Database 12c Plug In. Switch On. Get SMART.
Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.
More informationIn Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
More informationDISTRIBUTED AND PARALLELL DATABASE
DISTRIBUTED AND PARALLELL DATABASE SYSTEMS Tore Risch Uppsala Database Laboratory Department of Information Technology Uppsala University Sweden http://user.it.uu.se/~torer PAGE 1 What is a Distributed
More informationPreparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationTIM 50 - Business Information Systems
TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz March 1, 2015 The Database Approach to Data Management Database: Collection of related files containing records on people, places, or things.
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationData Mining and Database Systems: Where is the Intersection?
Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise
More informationREGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
299 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
More informationx64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
More informationREGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
244 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
More informationETPL Extract, Transform, Predict and Load
ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements
More informationToad for Oracle 8.6 SQL Tuning
Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationObservations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications
Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Roman Pfarrhofer and Andreas Uhl uhl@cosy.sbg.ac.at R. Pfarrhofer & A. Uhl 1 Carinthia Tech Institute
More informationCHAPTER 4 Data Warehouse Architecture
CHAPTER 4 Data Warehouse Architecture 4.1 Data Warehouse Architecture 4.2 Three-tier data warehouse architecture 4.3 Types of OLAP servers: ROLAP versus MOLAP versus HOLAP 4.4 Further development of Data
More informationHow to Design and Create Your Own Custom Ext Rep
Combinatorial Block Designs 2009-04-15 Outline Project Intro External Representation Design Database System Deployment System Overview Conclusions 1. Since the project is a specific application in Combinatorial
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationQuery Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationVisionet IT Modernization Empowering Change
Visionet IT Modernization A Visionet Systems White Paper September 2009 Visionet Systems Inc. 3 Cedar Brook Dr. Cranbury, NJ 08512 Tel: 609 360-0501 Table of Contents 1 Executive Summary... 4 2 Introduction...
More informationUsing In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationNetezza and Business Analytics Synergy
Netezza Business Partner Update: November 17, 2011 Netezza and Business Analytics Synergy Shimon Nir, IBM Agenda Business Analytics / Netezza Synergy Overview Netezza overview Enabling the Business with
More informationPARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
More informationWeb-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationFirst, we ll look at some basics all too often the things you cannot change easily!
Basic Performance Tips Purpose This document is inted to be a living document, updated often, with thoughts, tips and tricks related to getting maximum performance when using Tableau Desktop. The reader
More informationSQL Server Administrator Introduction - 3 Days Objectives
SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying
More information1. INTRODUCTION TO RDBMS
Oracle For Beginners Page: 1 1. INTRODUCTION TO RDBMS What is DBMS? Data Models Relational database management system (RDBMS) Relational Algebra Structured query language (SQL) What Is DBMS? Data is one
More information2 SYSTEM DESCRIPTION TECHNIQUES
2 SYSTEM DESCRIPTION TECHNIQUES 2.1 INTRODUCTION Graphical representation of any process is always better and more meaningful than its representation in words. Moreover, it is very difficult to arrange
More informationBase One's Rich Client Architecture
Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.
More informationOracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.
Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse
More informationWhite Paper. Optimizing the Performance Of MySQL Cluster
White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....
More informationPostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor
PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor The research leading to these results has received funding from the European Union's Seventh Framework
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationBig Data, Fast Data, Complex Data. Jans Aasman Franz Inc
Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationOracle Rdb Performance Management Guide
Oracle Rdb Performance Management Guide Solving the Five Most Common Problems with Rdb Application Performance and Availability White Paper ALI Database Consultants 803-648-5931 www.aliconsultants.com
More informationMauro Sousa Marta Mattoso Nelson Ebecken. and these techniques often repeatedly scan the. entire set. A solution that has been used for a
Data Mining on Parallel Database Systems Mauro Sousa Marta Mattoso Nelson Ebecken COPPEèUFRJ - Federal University of Rio de Janeiro P.O. Box 68511, Rio de Janeiro, RJ, Brazil, 21945-970 Fax: +55 21 2906626
More informationBig Data Mining Services and Knowledge Discovery Applications on Clouds
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades
More informationAerospace Software Engineering
16.35 Aerospace Software Engineering Software Architecture The 4+1 view Patterns Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Why Care About Software Architecture? An architecture provides a vehicle
More informationPersistent Data Structures
6.854 Advanced Algorithms Lecture 2: September 9, 2005 Scribes: Sommer Gentry, Eddie Kohler Lecturer: David Karger Persistent Data Structures 2.1 Introduction and motivation So far, we ve seen only ephemeral
More informationThe basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
More informationIntroduction to SQL for Data Scientists
Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform
More informationWhy compute in parallel? Cloud computing. Big Data 11/29/15. Introduction to Data Management CSE 344. Science is Facing a Data Deluge!
Why compute in parallel? Introduction to Data Management CSE 344 Lectures 23 and 24 Parallel Databases Most processors have multiple cores Can run multiple jobs simultaneously Natural extension of txn
More informationKey Requirements for a Job Scheduling and Workload Automation Solution
Key Requirements for a Job Scheduling and Workload Automation Solution Traditional batch job scheduling isn t enough. Short Guide Overcoming Today s Job Scheduling Challenges While traditional batch job
More informationFACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY Please note! This is a preliminary list of courses for the study year 2016/2017. Changes may occur! AUTUMN 2016 BACHELOR COURSES DIP217 Applied Software
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationNoSQL Database Options
NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has
More informationDesign and Implementation of the Heterogeneous Multikernel Operating System
223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,
More informationIntroduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
More informationLDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and
More informationEnterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.
Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services By Ajay Goyal Consultant Scalability Experts, Inc. June 2009 Recommendations presented in this document should be thoroughly
More informationBase Conversion written by Cathy Saxton
Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,
More informationBig Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide
Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide IBM Cognos Business Intelligence (BI) helps you make better and smarter business decisions faster. Advanced visualization
More informationVIBE. Visual Integrated Bioinformatics Environment. Enter the Visual Age of Computational Genomics. Whitepaper
VIBE Visual Integrated Bioinformatics Environment Whitepaper Enter the Visual Age of Computational Genomics INCOGEN, Inc. 104 George Perry Williamsburg, VA 23185 www.incogen.com Phone: 757-221-0550 info@incogen.com
More informationSmarter Balanced Assessment Consortium. Recommendation
Smarter Balanced Assessment Consortium Recommendation Smarter Balanced Quality Assurance Approach Recommendation for the Smarter Balanced Assessment Consortium 20 July 2012 Summary When this document was
More informationWhat is Data Virtualization? Rick F. van der Lans, R20/Consultancy
What is Data Virtualization? by Rick F. van der Lans, R20/Consultancy August 2011 Introduction Data virtualization is receiving more and more attention in the IT industry, especially from those interested
More informationIn-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015
In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems Alexandre_Boudnik@epam.com
More informationComparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
More informationGuide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
More informationClient/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
More information