AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS
|
|
|
- Sophia Porter
- 10 years ago
- Views:
Transcription
1 Advances in Information Mining ISSN: & E-ISSN: , Vol. 3, Issue 1, 2011, pp Available online at AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS CRISTIAN BISCONTI 1, ANGELO CORALLO 1, HOSSAM FARIS *, SALVATORE TOTARO 1 1Incubatore Euro-Mediterraneo in e-business Management, University of Salento, Lecce, Italy * Corresponding Author: [email protected] Received: April 14, 2011; Accepted: May 12, 2011 Abstract- Latent Semantic Analysis (LSA) is a powerful statistical theory and method for indexing, retrieving and analyzing textual information. With the increased demand for more efficient databases and computing power, a lot of research and work has focused on the practical side of the LSA. In this paper we introduce a design and implementation of an SQL extension for LSA applications. The developed SQL extension for LSA is aimed to utilized as an easy and simple interface for basic LSA and cosine similarity functions provided for indexing, searching and retrieving large textual information stored in open source relational database management system like PostgresSQL. Key words - Latent semantic indexing, SQL, Information retrieval Introduction Information retrieval and the integration of its tools in relational database management systems have been debated for a long time within the scientific community. Many researchers mentioned number of application fields for such an integration. Accordingly, many different approaches and techniques have been proposed to face these issues [1,2]. The increasing amount of available unstructured textual information documents motivated researchers to develop tools and design methodologies to improve the efficiency and to automate information retrieval systems [3, 4]. Almost all existing DBMS s have a full-text search system nowadays, but sometimes lexical matching methods can be inaccurate when they are used to match a user s query. When users search for required information, some documents which have the potential to be relevant might be missed and not included in the retrieved documents set only because the users are using different words with almost identical or similar meanings (synonyms). Naturally it would be better for users to retrieve information on the basis of a conceptual topic or meaning of a document rather than exact matching. LSA technique [5] tries to overcome these problems by using a statistical based approach for indexing documents. LSA is a statistical information retrieval method that works with words that have similar meanings or what is called synonyms. LSI is capable of retrieving documents based on the concepts they contain by using statistically derived conceptual indices instead of individual words for retrieval. Thus querying using search and retrieval systems based on LSI techniques can return a set of documents as a result that are semantically similar to the search query even if the results don t share a specific words the search query. LSA constructs a term-document matrix to identify the occurrence of unique terms within a collection of documents, in this term-document matrix, each term is represented by a row, and each document is represented by a column. Each matrix cell (intersection of row and column), holds the frequency of the appearance of a given term in a given document. LSI then applies local and global weight functions for terms based on term frequencies to reflect the relative importance of the term in each document in the collection, in another words, the term frequency indicates the weight of the term weight in a given document. In this way a document is represented by a set of terms with their frequencies ignoring the exact order of the terms in the document. LSA then performs a particular mathematical technique called Singular Value Decomposition (SVD) on the matrix to identify the structure in term usage and recognize patterns in the relationships between the terms used across the documents. SVD is a mathematical mapping technique to reduce the number of dimensions for a document s vector in the term space of the matrix, into a low dimensional space to make it more useable and efficient. As mentioned above, the use of LSA ensures the retrieving of the original semantic structure of the concept space. Accordingly it brings the advantages of improving the performance with respect to synonymy problem [6]. Indeed the traditional retrieval strategies are not able to face the challenge of discovering documents about the same topic that use a different vocabulary. In LSA, same concepts are all likely to be represented by a similar weighted combination of indexing variables. Moreover, while a large number of polysemous words in the query can reduce the precision of a search significantly. The use of a reduced representation in LSA makes it possible to remove some noise from the data. The applications of LSA range from the information retrieval to relevance feedback and information filtering. 19
2 AN-SQL extension for latent semantic analysis LSA technique has been also applied to solve problems related to the disambiguation of the word sense [7, 8], the cross language retrieval [9,10] or to develop semiautomatic tool for document classification [11]. These considerations bring the opportunity of developing an integration of LSA in a DBMS [12] in order to provide an easy access and simple to use SQL extension for LSA applications over huge databases. This paper presents the early results of the implemented integration, specifically the attempt to integrate the main LSA functions as SQL extension in PostgreSQL. This kind of integration gives a chance of LSA technique application as an autonomous additional means for information retrieval through simple SQL commands. Since we already have the availability of the full-text search functionality in PostgreSQL DBMS, in the context of this experimentation we can use both of them together. This work is divided into three parts: The first part introduces and presents suitable new data types designed in this work for the purpose of programming the LSA functions. The core implementation of LSA functions is described in details in part two. Finally, The third part discusses a real running example of using the implemented LSA functions within SQL queries processed over sampled data stored in an open source DBMS. The purpose of this example is to show the usefulness, powerful and the ease of use of these functions. Moreover, it shows how simply this work can apply the LSA techniques by using LSA functions introduced in this paper on textual information stored in DBMSs without the need of a third-party software. The introduced extension was developed and implemented in PostgreSQL DBMS, chosen for its level of usage in the research and scientific communities as it supports several important programming languages as C. Latent semantic data types The integration of LSA technique in SQL language standard as an extension requires defining new data types different from the primitive data types already present in PostgreSQL. These data types are designed to store all the information related to a given document needed for the LSA. In particular, for each document, we need to know the number of lexemes, their position within the document and their length. In the presented case, this information is embedded in a structure written in C language. We store the lexemes in a memory block contiguous to its structure. A. LSAVector data type The new data type LSAVector is desgined so we can store all the information related to a given document that belongs to a specific Knowledge Base (KB). Basically, an LSAVector is an ordered map that associates a value, that is the number of occurrences of a specific word in a specific document, with a key, that is the word that we are considering. For example, if d1 is a document as shown in Figure 1 in which there are only the terms alfa, beta, gamma and delta the s1 LSAVector associated with d1 is a map with alfa, beta, gamma, delta keys and s1[ gamma ] = 3 indicates that in d1 the term gamma appears tree times as shown in Fig. (1). B. LSAInfoKB data type By grouping LSAVectors that represent some specific documents, we obtain a KB. It is characterized as a set of vocabularies coming from the combination of all keys of the LSAVector s involved. For example, consider that s2 is an LSAVector associated with a document d2 reporting only the terms alfa and gamma ( s2[ beta ] = s2[ gamma ] = 0) and s1 and s2 are a KB with a vocabulary made by alfa, beta, gamma and delta terms. If we add an LSAVector s3 related to the document d3, that contain the word alfa, beta, gamma, iota, the KB that we obtain as result of combination has a vocabulary made by alfa, beta, gamma, delta and iota. Each LSAVector is a KB, but what we want is to keep separating these two types of data. KB is a type of data that has vocabulary information and a frequency matrix, named term-document matrix. It is designed as an array of arrays in which the term appearance frequency in each document is stored, but considering more dimensions. This means that if, for example, we create a KB by grouping s1,s2 and s3 LSAVector s associated with d1,d2 and d3 documents, we obtain a KB K with a vocabulary and a frequency matrix as the one presented in Fig. (2). In order to proceed the latent semantic analysis we need the U, S and V matrices, obtained from the SVD decomposition of the term-document matrix. The amount of information related to LSAInfoKB data type could be too expensive in term of memory allocation consider a LSAVector like a KB with only one document. Based on these considerations, we created two different data types: the one containing only the necessary information of a single document and the other containing the additional information related to the whole collection of documents also, that is the matrices of the SVD decomposition. For example considering a KB made by our d1, d2 and d3 documents, we obtain a LSAInfoKB type represented as shown in Fig. (3). Implemented functions for latent semantic indexing In order to operate with LSA analysis, we need specific functions related to specific tasks. Although Electronic document can exist in different formats, like doc, pdf, ppt or html, we are only interested in the textual information which is represented by the words stored in the files. For this reason we decided to operate on simple text field. In this section we introduce new four implemented functions for LSA applications. A. LSAVector tolsavector (text doc) First of all, we need a function able to translate a simple text field into a LSAVector, that is the data type introduced above. The number of components of this vector is equal to the number of unique terms present in the document; each component is associated to an Advances in Information Mining ISSN: & E-ISSN: , Vol. 3, Issue 1,
3 Cristian Bisconti, Angelo Corallo, Hossam Faris, Salvatore Totaro integer number that represents the term appearance frequency in the document as shown in Fig. (4). B. LSAVectorConcat and tolsavector oveloading When we group a huge amount of documents, this means many LSAVectors, we obtain a KB. This KB provides us the related vocabulary, that is the set of words present at least one time in one document of the KB, that enables the LSA analysis. An LSAVector may have many representations based on the choosed vocabulary. Then we need to overload the tolsavector function adding a new argument containing the vocabulary associated to KB that we are considering. The argument vocabulary is essentially another LSAVector data type generated by the KB, since all the information related to vocabulary is already contained within its structure. LSAVector tolsavector(text simpletext, LSAVector vocabulary): this function builds an LSAVector by adding terms in simple text. LSAVector tolsavectorconcat(lsavector v, LSAVector vocabulary): This function builds a new LSAVector with terms that are in vocabulary. C. LSAInfoKB tolsainfo(lsavector v) and LSAInfoKB tolsaikb(lsavector v)) In order to perform latent semantic analysis, we need to get information that must be combined to the KB through the application of a Singular Value Decomposition of the term-document matrix. The term-document matrix is the result of the grouping LSAVector elements based on the same vocabulary. In order to compute this information we created an aggregator function. An aggregator function in SQL has the same data type for argument op and for results res. This means that, to obtain an LSAInfoKB as result of the aggregator function, we need to aggregate LSAInfoKB. For this reason we need a function to convert a LSAVector in LSAInfoKnowlegdeBase. This function is LSAInfoKB tolsaikb(lsavector v). Now we can use the aggregator LSAInfoKB with a LSAVector like argument casted before using tolsakb function. The general operation of this aggregator is only an LSAVectorConcat function, while the final operation is to calculate the three matrices needed for the LSA Analisys. The U, S and V matrix allow us to make similarity estimation in SQL statement. D. double simcos(lsavector di, LSAVector q, LSAInfo KB) One important operation in LSA analysis is represented by the estimation of the similarity of a document di, that belongs to KB, according to a given query q. To perform this operation, we created the simcos function. This function takes the lsavectors q and di and translates them into the concept space using the following equations: q = S U q d = S U d the simcos function uses the S and U matrices stored in the KB LSAinfoMatrix (its third argument). The last step to obtain the real similarity value is to compute: q d In the same way we can estimate the similarity of a document that does not belong to the KB. It is important to point that this document might contain some terms that are not present in the KB, so we could estimate the portion of similarity due to the term (and concept) appearing in our KB. Power of LSA use in SQL environment: an example Here we give an example with some applications showing the simplicity of using the implementation of the SQL extension we proposed in this research. Considering the chance to use this implementation to index and search, by natural language process, the same kind of documents whose source could be web site, ftp server, filesystem or samba server, we suppose that a crawler takes.doc,.ppt,.pdf,.html files from these sources, extract the textual information from the structure of the documents and put all the terms within a simple text field database. Moreover, we suppose we have a simple table made as shown in Fig. (5): ID: is a numeric value used like primary key corpus: is the result of the textual information extraction task accomplished by the crawler during documents catching. filename: stores the name and the path of the document filetype: stores the type of the document catched by the crawler sourcename: stores the name of the source where the document has been catched by the crawler. This simple table suggests us that we have more several ways to perform a search on this document. Naturally we can perform a full-text search on corpus field or on filename field. We can apply a filter on file type. But we can also make a search based on semantic correlations. We can use more different groups of documents to make our KBs. There is no reason that suggest us to use all records in the documents table. By using LSA in SQL environment we have more flexibility in our analysis, we can consider different sets of documents as KB specifies a WHERE condition in our statement. This means that for all the considered subsets of documents we can estimate similarity between a query and one or more documents. For example we can consider a KB made only of documents that are present on myusbpen resource, and then we can use SQL-query like: LSAInfoKB(toLSAInfo(toLSAVector(corpus))) FROM documents WHERE sourcename= 'myusbpen' Choosing a KB, we can store it in a table for further similarity querying or we can make direct similarity query 21
4 AN-SQL extension for latent semantic analysis by using nested query. Suppose we have a query captured by a simple text field from a client application connected to our database. Obviously a query is exactly a document that can be converted into a LSA vector and used to make similarity estimation. In order to perform a similarity query without a materialized KB we can submit a query like this to our database. simcos ( tolsavector( corpus ), KB) FROM documents, ( LSAInfoMATRIX(toLSAVector(corpus)) FROM docs ) AS KB Obviously, without stored KB, the query is more timeexpensive than a query based on a materialized KB. In other words if we store our KB in a table defined like: CREATE TABLE AS LSAInfoMATRIX( tolsainfo(tolsavector(corpus) ) FROM documents we can evaluate the similarity level with a query like this: simcos ( tolsavector(corpus), KB. kb ) FROM documents, KB In order to appreciate the potential of SQL-LSA fusion we can consider the following example. Suppose that we found a query made by a natural language that satisfies a specific knowledge needed for our work. We can constantly update the documents simply using a view based on a semantic query as: CREATE VIEW mypersonalknowledgecollectiononelectric AS simcos( tolsavector(corpus), KB. kb ) FROM documents, ( LSAInfoMATRIX(toLSAVector(corpus)) FROM docs ) AS KB The semantic similarity obtained from the LSA analysis can be used within a SQL command as DELETE or INSERT, including a WHERE clause, in order to manage our KB. Suppose we want to prune our KB, deleting or moving the documents that have a level of semantic similarity less than 0.2 with respect to a specific query. This task can be done using the following SQL statement: DELETE FROM documents WHERE id in id, simcos(tolsavector(corpus), KB, kb) < 0,2 FROM documents, ( select LSAInfoMATRIX(toLSAVector(corpus)) from docs )as KB Notice that in the query above, the deletion is based on a LSA analysis executed on the fly.the potential showed above is possible because our implementation is considered as an extension of the SQL language. Future work Scalability is one of the most important issues in designing software. A significant degradation in system's performance must not happen when the database becomes larger. In the scope of this research, a further work is aimed to be done about the reliability and efficiency of using the implemented SQL extension for real and huge database of textual information, hopefully, with maintained overall system performance. Conclusion In this research we proposed a design and implementation of an SQL extension for latent semantic analysis applications. We introduced the data types and functions used to provide the basic LSA operations. Furthermore some real examples were given in order to show the simplicity of using the extension in SQL queries. References [1] McHugh J., Abiteboul S., Goldman R., Quass D., Widom J. (1997) Lore: A database management system for semistructured data. Technical report, Stanford University. [2] Manber U., Wu S. (1993) GLIMPSE: A tool to search through entire file systems. Technical Report TR 93-34, Department of Computer Science, University of AZ, Tuscon, Arizona, [3] Paris L.A.H., Tibbo H.R. (1998) Information Processing and Management, 34(23): [4] Belkin N.J., Croft W.B. (1987) Annual Review of Information Science and Technology, 22(9), [5] Furnas G.W., Deerwester S., Dumais S.T., Landauer T.K., Harshman R.A., Streeter L.A., Lochbaum K.E. (1988) Information retrieval using a singular value decomposition model of latent semantic structure. International Conference on Research and Development in IR, New York. Advances in Information Mining ISSN: & E-ISSN: , Vol. 3, Issue 1,
5 Cristian Bisconti, Angelo Corallo, Hossam Faris, Salvatore Totaro [6] Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K. and Harshman R.A., (1990) Journal of the American Society for Information Science, 41, [7] Gallant I. (1991) Neural Computation, 3. [8] Schutze H. (1992) Dimensions of meaning, in Proceedings of Supercomputing. [9] Landauer T.K. and Littman M.L. (1990) Fully automatic cross-language document retrieval using latent semantic indexing, proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, UWCentre for the New OED and Text Research, Waterloo Ontario. [10] Young P.G. (1994) Cross-Language Information Retrieval Using Latent Semantic Indexing, Master s thesis, The University of Knoxville, TN. [11] Ceglowski M., Coburn, A.,Cuadrado J. (2003) Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI03), /03. [12] Kumaran A. and Haritsa J.R. (2005) SemEQUAL: Multilingual Semantic Matching in Relational Systems, DASFAA. [13] Rosario Barbara (2000) Latent Semantic Indexing: An overview - INFOSYS 240 Spring [14] April Kontostathis (2007) International Conference on System Sciences, HICSS
6 AN-SQL extension for latent semantic analysis Fig. 1- d1 document and related LSAVector s1 Fig. 2- vocabulary and term-document matrix for the KB made by s1,s2 and s3 LSAVectors Fig. 3- term-document matrix and U,S and V matrix for the Knowlesdge Base made by s1,s2 and s3 LSAVectors Fig. 4- function tolsavector applied to a simple text document Advances in Information Mining ISSN: & E-ISSN: , Vol. 3, Issue 1,
7 Cristian Bisconti, Angelo Corallo, Hossam Faris, Salvatore Totaro Fig. 5- Table format 25
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.
White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,
Latent Semantic Indexing with Selective Query Expansion Abstract Introduction
Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes
SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing
Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie
An Introduction to Random Indexing
MAGNUS SAHLGREN SICS, Swedish Institute of Computer Science Box 1263, SE-164 29 Kista, Sweden [email protected] Introduction Word space models enjoy considerable attention in current research on semantic indexing.
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Gallito 2.0: a Natural Language Processing tool to support Research on Discourse
Presented in the Twenty-third Annual Meeting of the Society for Text and Discourse, Valencia from 16 to 18, July 2013 Gallito 2.0: a Natural Language Processing tool to support Research on Discourse Guillermo
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis
Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637
Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435
Linear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, [email protected] Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
Enhanced data mining analysis in higher educational system using rough set theory
African Journal of Mathematics and Computer Science Research Vol. 2(9), pp. 184-188, October, 2009 Available online at http://www.academicjournals.org/ajmcsr ISSN 2006-9731 2009 Academic Journals Review
5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2
Class Announcements TIM 50 - Business Information Systems Lecture 15 Database Assignment 2 posted Due Tuesday 5/26 UC Santa Cruz May 19, 2015 Database: Collection of related files containing records on
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
TIM 50 - Business Information Systems
TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz March 1, 2015 The Database Approach to Data Management Database: Collection of related files containing records on people, places, or things.
The process of database development. Logical model: relational DBMS. Relation
The process of database development Reality (Universe of Discourse) Relational Databases and SQL Basic Concepts The 3rd normal form Structured Query Language (SQL) Conceptual model (e.g. Entity-Relationship
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Content Problems of managing data resources in a traditional file environment Capabilities and value of a database management
From Databases to Natural Language: The Unusual Direction
From Databases to Natural Language: The Unusual Direction Yannis Ioannidis Dept. of Informatics & Telecommunications, MaDgIK Lab University of Athens, Hellas (Greece) [email protected] http://www.di.uoa.gr/
Geodatabase Programming with SQL
DevSummit DC February 11, 2015 Washington, DC Geodatabase Programming with SQL Craig Gillgrass Assumptions Basic knowledge of SQL and relational databases Basic knowledge of the Geodatabase We ll hold
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
DBMS / Business Intelligence, SQL Server
DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.
C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
Latent Semantic Analysis for German literature investigation
Latent Semantic Analysis for German literature investigation Preslav Nakov Sofia University, Rila Solutions 27, Acad. G. Botchev Str, Sofia, Bulgaria [email protected], [email protected] Abstract. The
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
Course 103402 MIS. Foundations of Business Intelligence
Oman College of Management and Technology Course 103402 MIS Topic 5 Foundations of Business Intelligence CS/MIS Department Organizing Data in a Traditional File Environment File organization concepts Database:
A Framework for Ontology-Based Knowledge Management System
A Framework for Ontology-Based Knowledge Management System Jiangning WU Institute of Systems Engineering, Dalian University of Technology, Dalian, 116024, China E-mail: [email protected] Abstract Knowledge
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Data Warehouse: Introduction
Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases
3 CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases About This Document 3 Methods for Accessing Relational Database Data 4 Selecting a SAS/ACCESS Method 4 Methods for Accessing DBMS Tables
Evaluation of Sub Query Performance in SQL Server
EPJ Web of Conferences 68, 033 (2014 DOI: 10.1051/ epjconf/ 201468033 C Owned by the authors, published by EDP Sciences, 2014 Evaluation of Sub Query Performance in SQL Server Tanty Oktavia 1, Surya Sujarwo
Soft Clustering with Projections: PCA, ICA, and Laplacian
1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix
A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES
Techniques For Optimizing The Relationship Between Data Storage Space And Data Retrieval Time For Large Databases TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL
Exam in course TDT4215 Web Intelligence - Solutions and guidelines -
English Student no:... Page 1 of 12 Contact during the exam: Geir Solskinnsbakk Phone: 94218 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Friday May 21, 2010 Time: 0900-1300 Allowed
Spam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
1 File Processing Systems
COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.
ProteinQuest user guide
ProteinQuest user guide 1. Introduction... 3 1.1 With ProteinQuest you can... 3 1.2 ProteinQuest basic version 4 1.3 ProteinQuest extended version... 5 2. ProteinQuest dictionaries... 6 3. Directions for
Module 1: Getting Started with Databases and Transact-SQL in SQL Server 2008
Course 2778A: Writing Queries Using Microsoft SQL Server 2008 Transact-SQL About this Course This 3-day instructor led course provides students with the technical skills required to write basic Transact-
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design
Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal
IT2304: Database Systems 1 (DBS 1)
: Database Systems 1 (DBS 1) (Compulsory) 1. OUTLINE OF SYLLABUS Topic Minimum number of hours Introduction to DBMS 07 Relational Data Model 03 Data manipulation using Relational Algebra 06 Data manipulation
Writing Queries Using Microsoft SQL Server 2008 Transact-SQL
Course 2778A: Writing Queries Using Microsoft SQL Server 2008 Transact-SQL Length: 3 Days Language(s): English Audience(s): IT Professionals Level: 200 Technology: Microsoft SQL Server 2008 Type: Course
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
Database Systems. Lecture 1: Introduction
Database Systems Lecture 1: Introduction General Information Professor: Leonid Libkin Contact: [email protected] Lectures: Tuesday, 11:10am 1 pm, AT LT4 Website: http://homepages.inf.ed.ac.uk/libkin/teach/dbs09/index.html
Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives
Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives Describe how the problems of managing data resources in a traditional file environment are solved
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Tax Fraud in Increasing
Preventing Fraud with Through Analytics Satya Bhamidipati Data Scientist Business Analytics Product Group Copyright 2014 Oracle and/or its affiliates. All rights reserved. 2 Tax Fraud in Increasing 27%
Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
Data Warehouse Snowflake Design and Performance Considerations in Business Analytics
Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Analyze Database Optimization Techniques
IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.8, August 2010 275 Analyze Database Optimization Techniques Syedur Rahman 1, A. M. Ahsan Feroz 2, Md. Kamruzzaman 3 and
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor
Indexing Techniques for Data Warehouses Queries. Abstract
Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 [email protected] [email protected] Abstract Recently,
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
In-memory databases and innovations in Business Intelligence
Database Systems Journal vol. VI, no. 1/2015 59 In-memory databases and innovations in Business Intelligence Ruxandra BĂBEANU, Marian CIOBANU University of Economic Studies, Bucharest, Romania [email protected],
Mining Text Data: An Introduction
Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo
Folksonomies versus Automatic Keyword Extraction: An Empirical Study
Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Wiley. Automated Data Collection with R. Text Mining. A Practical Guide to Web Scraping and
Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining Simon Munzert Department of Politics and Public Administration, Germany Christian Rubba University ofkonstanz, Department
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Writing Queries Using Microsoft SQL Server 2008 Transact-SQL
Lincoln Land Community College Capital City Training Center 130 West Mason Springfield, IL 62702 217-782-7436 www.llcc.edu/cctc Writing Queries Using Microsoft SQL Server 2008 Transact-SQL Course 2778-08;
Foundations of Business Intelligence: Databases and Information Management
Chapter 5 Foundations of Business Intelligence: Databases and Information Management 5.1 Copyright 2011 Pearson Education, Inc. Student Learning Objectives How does a relational database organize data,
A database can simply be defined as a structured set of data
Database Management Systems A database can simply be defined as a structured set of data that is any collection of data stored in mass storage that can serve as the data source for a variety of applications
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Enabling Semantic Search in Geospatial Metadata Catalogue to Support Polar Sciences
Enabling Semantic Search in Geospatial Metadata Catalogue to Support Polar Sciences Wenwen Li 1 and Vidit Bhatia 2 1 GeoDa Center for Geospatial Analysis and Computation School of Geographical Sciences
TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS
9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence
Topics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
