Data Mining Systems Development. Arno Siebes

Size: px
Start display at page:

Download "Data Mining Systems Development. Arno Siebes"

Transcription

1 Data Mining Systems Development Arno Siebes

2 A Variety of Systems Data mining systems development depends on why you develop the system in the first place: Mining Research Systems, e.g., for: - algorithmic research - systems research Applied Research, e.g., tools for bioinformatics Commercial use, e.g., - horizontal systems, i.e., a system that supports a wide variety of algorithms and (data mining) problems - vertical systems, i.e., a system that is completely targeted towards a specific (business!) problem These different applications, obviously, share requirements on the system but also have different requirements. Where they share requirements, the degrees and the priorities can be rather different. Let us list and contrast these requirements. Arno Siebes Data Mining Systems Development (page 2 of 27)

3 Algorithms The need for algorithms varies with the application: Many and easily extended pure research: extensibility is crucial for algorithmic research horizontal commercial systems: under the precondition of manageability. Specially tailored vertical commercial systems: often even so far as having parameters fixed in an automation phase. applied research: clearly, but extensibility is also important. Extensibility can be achieved in various ways: a monolithic system; it means lots of code redundancy and code management becomes cumbersome a plug and play approach using code from various sources and API s (the Kepler approach); a class library approach, only suitable for well-trained programmers, but allows for lots of code sharing (the MLC++ approach) a components based approach: decomposing algorithms in: representation language, search operators, search strategies and quality functions (the Keso approach). Arno Siebes Data Mining Systems Development (page 3 of 27)

4 Data Management: storage The data has to reside somewhere, it could be: a file system: algorithmic research a dbms within the system: note that the intended difference between a file system and a dbms is the presence of search accelerators (indices), special operators and the support of complex data structures. systems research: you want access to all aspects that govern scalability applied research: the data doesn t easily fit into standard (relational) dbmss. a commercial dbms: commercial systems: that is where the data resides! Note that data import facilities into an internal system may speed up processing considerably, but it may not be what your client actually wants. Arno Siebes Data Mining Systems Development (page 4 of 27)

5 KDD Process Support We all know that data analysis is more than just running algorithms on a data set, there is lots of pre and post processing: Data Preparation Data Selection Data Cleaning Data Transformations Experimental Set Up Random subset selection Cross validation Post Processing Result inspection (e.g., visualization) Result exportation Arno Siebes Data Mining Systems Development (page 5 of 27)

6 Data Preparation There are various ways in which one can prepare data: Unix Tools and Scripting Languages algorithmic research: all the flexibility you need systems research: well, that is if there are no interesting research questions involved SQL applied research: if they are well trained in such languages horizontal systems: if they are well trained in such languages Visual Tools in the Interface commercial systems: the other possibilities are simply too difficult for most users applied research and systems research: there are interesting research questions! Arno Siebes Data Mining Systems Development (page 6 of 27)

7 Experimental Set Up Who needs an experimental set up? You may think everyone, but that isn t true Who doesn t? Why? in vertical applications there may be a hidden experimental set up, but the user should not be bothered by it. the users are most probably simply not smart enough to know what they are doing the users won t have the resources to do this Does this mean you don t test? (Shudder!) not necessarily: the system may have been tuned beforehand, while the results are monitored by experts. Arno Siebes Data Mining Systems Development (page 7 of 27)

8 Post Processing Post processing needs vary: None: algorithmic research: the results are only used in the validation of the algorithm Result Inspection: systems research: interesting research questions, e.g., how to visualize sets of association rules applied research and commercial systems: decisions will be made on the basis of these results, thus thorough understanding is necessary Result Exportation: horizontal systems: at least some XML export facility vertical systems: seamless integration with other products. Arno Siebes Data Mining Systems Development (page 8 of 27)

9 Developing Systems: my experiences and plans In the remainder of this talk, I will discuss some of my experiences and plans: The Keso system: system development research some algorithmic research goal a (modest) horizontal system Data Surveyor at Data Distilleries parallel with the Keso development original goal: a horizontal system products: vertical systems for analytical CRM The future: applied research: bioinformatics system development research to support this work Arno Siebes Data Mining Systems Development (page 9 of 27)

10 The Keso System The Keso system is a client server system: Client: Server user interface mining engine mining server (dbms) caching facilities We ll discuss briefly the mining engine and the mining server aspects of Keso Arno Siebes Data Mining Systems Development (page 10 of 27)

11 The Mining Engine One of the aspects of the KESO data mining kernel architecture is that each data mining algorithm consists of four components: 1. A model representation language. 2. A (local) quality function: which of these models fit the database best? 3. A search strategy: exhaustive or heuristics driven search for good/best models 4. Search operators: that define how a search strategy goes through the space of all possible models Arno Siebes Data Mining Systems Development (page 11 of 27)

12 The Implementation In the architecture this is reflected as follows: Search Man. Search Space Maintainer Qual. C. Model Gen Data Base Arno Siebes Data Mining Systems Development (page 12 of 27)

13 Components Search Manager: Contains a number of Search Modules. Each implements a Search Strategy Description Generator: Contains a number of operator modules. Each module implements an operator on a specific description language. Quality Computer: Implements a number of quality function modules and has a separate module to query the database for cross-tables. Arno Siebes Data Mining Systems Development (page 13 of 27)

14 Search Space Manager: All communication between the components is through records in the Search Space Manager. Some fields in these records are: An oid (logical name of ϕ) The description (ϕ itself) The quality of ϕ The operator o with which ϕ is constructed The parent model, i.e., ψ, with o(ψ) = ϕ While the search space is explored, these records are stored in a database. Arno Siebes Data Mining Systems Development (page 14 of 27)

15 Lessons Learned: The Search Space Manager Why did we decide to store the explored part of a search space? it allows users to see the route that the (heuristic) search has taken and explore other avenues from arbitrary points in that space it can be used as a caching mechanism, for points that you already visited you don t have to repeat the quality computation. Caching: it is a nice idea, but not that practical. The search space grows quickly. Maintaining a good index structure is expensive and linear search becomes expensive quickly as well Conclusion: it is better to speed-up the quality computation (i.e., database access). Exploring the Search Space: this is a good idea, but requires good user interface support. Experience learns that it is not necessary (not to say an overkill) to store all points, it is far better just to store those points that were considered interesting during the search and do some recomputations if necessary. Arno Siebes Data Mining Systems Development (page 15 of 27)

16 Lessons Learned: The Components Approach Why did we choose for a components based approach? extensibility through component sharing localised database access (quality computation) Extensibility: search strategies: clearly true operators and quality functions: depend strongly on the modelling language. Flexibility within one language is good. Localised database access: good results, especially because of uniform database access (through data cubes). This allows for all kinds of optimization. Arno Siebes Data Mining Systems Development (page 16 of 27)

17 The Mining Server At CWI we have a reasonably powerfull SGI Origin 2000 SMP machine: /300 MHz MIPS and 64GB main memory. It is nice to exploit such hardware fro the mining server. The possibilities lie in a few aspects of the Keso system: KESO use of the main memory DBMS Monet (a CWI/ UvA research prototype) The fact that the search of many data mining algorithms has a zooming type of behaviour The fact that quality functions are based on relatively simple aggregates, such as count and sum The fact that these aggregate queries are submitted in batches of related queries. Arno Siebes Data Mining Systems Development (page 17 of 27)

18 Monet Monet is a main memory DBMS, i.e., it is assumed that the database hotspot will always fit in main memory. An important aspect for us is that the data is stored as columns (per Attribute) rather than as rows (per tuple) as is customary. Each attribute is stored in a BAT (binary association table) as oid-value pairs. Data mining algorithms tend to investigate mostly one and at most a few attributes at the same time. In the case of the classification trees, one considers the qualities of the splits per attribute: this ensures that the hotspot fits it allows parallelization: - distribute the attributes over the processors - distribute each attribute over the processors Arno Siebes Data Mining Systems Development (page 18 of 27)

19 Zooming Consider again the construction of a classification tree. Each node n in the tree describes a subset s n of the database: If n is a child of m, s n is a subset of s m In other words, if we store the select set of the parent, we only have to scan that part of the database for the evaluation of the candidate sons: this requires the extra storage of one column only the optimization in time is exponential in the depth of the tree. Similar observations hold for many other mining algorithms. Arno Siebes Data Mining Systems Development (page 19 of 27)

20 The Aggregates In a client-server architecture (such as KESO), it is beneficial if the aggregates on which the quality functions are based are computed in the server rather than in the client. From a parallelization point of view, these aggregates have the nice property that they distribute: the count of the whole select set is the sum of the counts of a partition of the select set the sum is the sum of the sums Arno Siebes Data Mining Systems Development (page 20 of 27)

21 Multi-query optimization Traditionally, a DBMS optimizes each query posed to it separately. It does not take other queries in the system into account. While constructing a classification tree, we consider all possible extensions at the same time. This leads to a batch of strongly related aggregate queries to the database. Optimizing the combined set of queries (which is a.o., possible because of the distributive nature of the aggregates) leads to far better response times than the set of optimized queries. To do this, a special optimizer (called the MIL squeezer) has been implemented in Monet. Note, normal query optimization is still very much part of the process, of course. Arno Siebes Data Mining Systems Development (page 21 of 27)

22 Lessons learned Using our own DBMS was a major asset in the work: all unnecessary components such as locking and transaction amangement in general could be left out. operators such as the datacube could be implemented directly in the core of the DBMS all optimization tricks could be built in directly. all of this results in far greater speed than commercial platforms such as Oracle can deliver. Moreover, it means that new data structures etc, are easily supported. In particular, joint work with Arno Knobbe (syllogic) shows that the whole framework can be easily generalized for multi-relational mining. If you want a drawback: stability. Arno Siebes Data Mining Systems Development (page 22 of 27)

23 Experiences at DD As noted before, the development at Data Distilleries went in parallel with the development of the KESO system (in which they were a part of the team together with UH and GMD) Their technical experiences are in part what I ve told you. But there are some interesting business experiences: User Interface: the importance of this aspect cannot be overestimated. The interface should support the whole process in an intuitive fashion. E.g., SQL is far too difficult for many users. Server: exploiting an internal DBMS is ok, but you should import data from a wide variety of sources. speed is not that important, other aspects dominate Extensibility: Most clients are interested in solving business problems, i.e., require vertical systems. It is not the range of algorithms but the suitability of the algorithm for that purpose that counts. This also means that exporting results has to integrate with other products. Arno Siebes Data Mining Systems Development (page 23 of 27)

24 The Future I have learned a lot from the development of the Keso system, and I will start developing a new one at Utrecht University. again we will use Monet again we will think components based But, there is also going to be a difference: it is going to be an applied research vehicle for bioinformatics (in particular for genomics and proteomics) That is, it is going to be biologists workbench: including a dbms for biological data supporting mining of biological data Arno Siebes Data Mining Systems Development (page 24 of 27)

25 Biomolecular Data Biomolecular data is far from straight forward, some main reasons are: During inheritance DNA sequences are changed (both through cross-over and mutation). only parts of DNA strings encode for genes that may be turned on or off or even be silenced. The rest is called junk DNA... from the transcribed RNA large parts are introns that are subsequently removed only part of the mrna string is translated into protein because of the redundancy of the genetic code, different mrna substrings may yield the same protein. subtly different protein molecules may act completely interchangeable, because large parts of the protein simply guarantee the required geometry of the more active parts of the molecule. Arno Siebes Data Mining Systems Development (page 25 of 27)

26 Searching Because of this approximate nature of the strings, searching databases is not: the standard database equality search not even exact string matching It is an alignment problem with penalty functions for the errors There are a number of algorithms for this problem: using on dynamic programming: find the best (partial) alignment between two strings using Hidden Markov Models: how likely is one string produced by a machine that could have produced another string? Moreover, such algorithms have been transfered to a database setting in programs as BLAST and FASTA However, these tricks are far from the regular database approach, i.e., indices. One of the goals will be to built index structures in Monet targeted towards alignment searches. Arno Siebes Data Mining Systems Development (page 26 of 27)

27 Mining Biomolecular data is a paradise for data miners. There is not enough theory to attack the (enormous amounts of) data in a traditional way. Examples of questions are: discovering genes (there are approaches using HMMs, but more traditional techniques such as classification trees are also used). building philo-genetic trees (i.e, tracing back evolution) function prediction for proteins uncovering metabolic pathways Arno Siebes Data Mining Systems Development (page 27 of 27)

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

More information

Topics in basic DBMS course

Topics in basic DBMS course Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Bioinformatics Resources at a Glance

Bioinformatics Resources at a Glance Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

More information

SQL Server 2005 Features Comparison

SQL Server 2005 Features Comparison Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions

More information

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) 305 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006 Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

More information

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

RNA Structure and folding

RNA Structure and folding RNA Structure and folding Overview: The main functional biomolecules in cells are polymers DNA, RNA and proteins For RNA and Proteins, the specific sequence of the polymer dictates its final structure

More information

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches Concepts of Database Management Seventh Edition Chapter 9 Database Management Approaches Objectives Describe distributed database management systems (DDBMSs) Discuss client/server systems Examine the ways

More information

SELF-SERVICE ANALYTICS: SMART INTELLIGENCE WITH INFONEA IN A CONTINUUM BETWEEN INTERACTIVE REPORTS, ANALYTICS FOR BUSINESS USERS AND DATA SCIENCE

SELF-SERVICE ANALYTICS: SMART INTELLIGENCE WITH INFONEA IN A CONTINUUM BETWEEN INTERACTIVE REPORTS, ANALYTICS FOR BUSINESS USERS AND DATA SCIENCE SELF-SERVICE BUSINESS INTELLIGENCE / INFONEA FEATURE OVERVIEW / SELF-SERVICE ANALYTICS: SMART INTELLIGENCE WITH INFONEA IN A CONTINUUM BETWEEN INTERACTIVE REPORTS, ANALYTICS FOR BUSINESS USERS AND DATA

More information

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

BLAST. Anders Gorm Pedersen & Rasmus Wernersson BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

bigdata Managing Scale in Ontological Systems

bigdata Managing Scale in Ontological Systems Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural

More information

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,

More information

Tertiary Storage and Data Mining queries

Tertiary Storage and Data Mining queries An Architecture for Using Tertiary Storage in a Data Warehouse Theodore Johnson Database Research Dept. AT&T Labs - Research johnsont@research.att.com Motivation AT&T has huge data warehouses. Data from

More information

PeopleTools Tables: The Application Repository in the Database

PeopleTools Tables: The Application Repository in the Database PeopleTools Tables: The Application Repository in the Database by David Kurtz, Go-Faster Consultancy Ltd. Since their takeover of PeopleSoft, Oracle has announced project Fusion, an initiative for a new

More information

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) 820 REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) (See also General Regulations) BMS1 Admission to the Degree To be eligible for admission to the degree of Bachelor

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

APP INVENTOR. Test Review

APP INVENTOR. Test Review APP INVENTOR Test Review Main Concepts App Inventor Lists Creating Random Numbers Variables Searching and Sorting Data Linear Search Binary Search Selection Sort Quick Sort Abstraction Modulus Division

More information

In-Database Analytics

In-Database Analytics Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing

More information

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle Database 12c Plug In. Switch On. Get SMART. Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

DISTRIBUTED AND PARALLELL DATABASE

DISTRIBUTED AND PARALLELL DATABASE DISTRIBUTED AND PARALLELL DATABASE SYSTEMS Tore Risch Uppsala Database Laboratory Department of Information Technology Uppsala University Sweden http://user.it.uu.se/~torer PAGE 1 What is a Distributed

More information

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

TIM 50 - Business Information Systems

TIM 50 - Business Information Systems TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz March 1, 2015 The Database Approach to Data Management Database: Collection of related files containing records on people, places, or things.

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) 299 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) 244 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference

More information

ETPL Extract, Transform, Predict and Load

ETPL Extract, Transform, Predict and Load ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements

More information

Toad for Oracle 8.6 SQL Tuning

Toad for Oracle 8.6 SQL Tuning Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Roman Pfarrhofer and Andreas Uhl uhl@cosy.sbg.ac.at R. Pfarrhofer & A. Uhl 1 Carinthia Tech Institute

More information

CHAPTER 4 Data Warehouse Architecture

CHAPTER 4 Data Warehouse Architecture CHAPTER 4 Data Warehouse Architecture 4.1 Data Warehouse Architecture 4.2 Three-tier data warehouse architecture 4.3 Types of OLAP servers: ROLAP versus MOLAP versus HOLAP 4.4 Further development of Data

More information

How to Design and Create Your Own Custom Ext Rep

How to Design and Create Your Own Custom Ext Rep Combinatorial Block Designs 2009-04-15 Outline Project Intro External Representation Design Database System Deployment System Overview Conclusions 1. Since the project is a specific application in Combinatorial

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Visionet IT Modernization Empowering Change

Visionet IT Modernization Empowering Change Visionet IT Modernization A Visionet Systems White Paper September 2009 Visionet Systems Inc. 3 Cedar Brook Dr. Cranbury, NJ 08512 Tel: 609 360-0501 Table of Contents 1 Executive Summary... 4 2 Introduction...

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Netezza and Business Analytics Synergy

Netezza and Business Analytics Synergy Netezza Business Partner Update: November 17, 2011 Netezza and Business Analytics Synergy Shimon Nir, IBM Agenda Business Analytics / Netezza Synergy Overview Netezza overview Enabling the Business with

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

First, we ll look at some basics all too often the things you cannot change easily!

First, we ll look at some basics all too often the things you cannot change easily! Basic Performance Tips Purpose This document is inted to be a living document, updated often, with thoughts, tips and tricks related to getting maximum performance when using Tableau Desktop. The reader

More information

SQL Server Administrator Introduction - 3 Days Objectives

SQL Server Administrator Introduction - 3 Days Objectives SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying

More information

1. INTRODUCTION TO RDBMS

1. INTRODUCTION TO RDBMS Oracle For Beginners Page: 1 1. INTRODUCTION TO RDBMS What is DBMS? Data Models Relational database management system (RDBMS) Relational Algebra Structured query language (SQL) What Is DBMS? Data is one

More information

2 SYSTEM DESCRIPTION TECHNIQUES

2 SYSTEM DESCRIPTION TECHNIQUES 2 SYSTEM DESCRIPTION TECHNIQUES 2.1 INTRODUCTION Graphical representation of any process is always better and more meaningful than its representation in words. Moreover, it is very difficult to arrange

More information

Base One's Rich Client Architecture

Base One's Rich Client Architecture Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

White Paper. Optimizing the Performance Of MySQL Cluster

White Paper. Optimizing the Performance Of MySQL Cluster White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

More information

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor The research leading to these results has received funding from the European Union's Seventh Framework

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Oracle Rdb Performance Management Guide

Oracle Rdb Performance Management Guide Oracle Rdb Performance Management Guide Solving the Five Most Common Problems with Rdb Application Performance and Availability White Paper ALI Database Consultants 803-648-5931 www.aliconsultants.com

More information

Mauro Sousa Marta Mattoso Nelson Ebecken. and these techniques often repeatedly scan the. entire set. A solution that has been used for a

Mauro Sousa Marta Mattoso Nelson Ebecken. and these techniques often repeatedly scan the. entire set. A solution that has been used for a Data Mining on Parallel Database Systems Mauro Sousa Marta Mattoso Nelson Ebecken COPPEèUFRJ - Federal University of Rio de Janeiro P.O. Box 68511, Rio de Janeiro, RJ, Brazil, 21945-970 Fax: +55 21 2906626

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

More information

Aerospace Software Engineering

Aerospace Software Engineering 16.35 Aerospace Software Engineering Software Architecture The 4+1 view Patterns Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Why Care About Software Architecture? An architecture provides a vehicle

More information

Persistent Data Structures

Persistent Data Structures 6.854 Advanced Algorithms Lecture 2: September 9, 2005 Scribes: Sommer Gentry, Eddie Kohler Lecturer: David Karger Persistent Data Structures 2.1 Introduction and motivation So far, we ve seen only ephemeral

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Introduction to SQL for Data Scientists

Introduction to SQL for Data Scientists Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform

More information

Why compute in parallel? Cloud computing. Big Data 11/29/15. Introduction to Data Management CSE 344. Science is Facing a Data Deluge!

Why compute in parallel? Cloud computing. Big Data 11/29/15. Introduction to Data Management CSE 344. Science is Facing a Data Deluge! Why compute in parallel? Introduction to Data Management CSE 344 Lectures 23 and 24 Parallel Databases Most processors have multiple cores Can run multiple jobs simultaneously Natural extension of txn

More information

Key Requirements for a Job Scheduling and Workload Automation Solution

Key Requirements for a Job Scheduling and Workload Automation Solution Key Requirements for a Job Scheduling and Workload Automation Solution Traditional batch job scheduling isn t enough. Short Guide Overcoming Today s Job Scheduling Challenges While traditional batch job

More information

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY Please note! This is a preliminary list of courses for the study year 2016/2017. Changes may occur! AUTUMN 2016 BACHELOR COURSES DIP217 Applied Software

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

NoSQL Database Options

NoSQL Database Options NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has

More information

Design and Implementation of the Heterogeneous Multikernel Operating System

Design and Implementation of the Heterogeneous Multikernel Operating System 223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,

More information

Introduction Predictive Analytics Tools: Weka

Introduction Predictive Analytics Tools: Weka Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

More information

Table of Contents. June 2010

Table of Contents. June 2010 June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

More information

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc. Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services By Ajay Goyal Consultant Scalability Experts, Inc. June 2009 Recommendations presented in this document should be thoroughly

More information

Base Conversion written by Cathy Saxton

Base Conversion written by Cathy Saxton Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,

More information

Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide

Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide IBM Cognos Business Intelligence (BI) helps you make better and smarter business decisions faster. Advanced visualization

More information

VIBE. Visual Integrated Bioinformatics Environment. Enter the Visual Age of Computational Genomics. Whitepaper

VIBE. Visual Integrated Bioinformatics Environment. Enter the Visual Age of Computational Genomics. Whitepaper VIBE Visual Integrated Bioinformatics Environment Whitepaper Enter the Visual Age of Computational Genomics INCOGEN, Inc. 104 George Perry Williamsburg, VA 23185 www.incogen.com Phone: 757-221-0550 info@incogen.com

More information

Smarter Balanced Assessment Consortium. Recommendation

Smarter Balanced Assessment Consortium. Recommendation Smarter Balanced Assessment Consortium Recommendation Smarter Balanced Quality Assurance Approach Recommendation for the Smarter Balanced Assessment Consortium 20 July 2012 Summary When this document was

More information

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy What is Data Virtualization? by Rick F. van der Lans, R20/Consultancy August 2011 Introduction Data virtualization is receiving more and more attention in the IT industry, especially from those interested

More information

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015 June 29-30, 2015 Contacts Alexandre Boudnik Senior Solution Architect, EPAM Systems Alexandre_Boudnik@epam.com

More information

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information