Full-text Search in Intermediate Data Storage of FCART

Similar documents
About Universality and Flexibility of FCA-based Software Tools

FCART: A New FCA-based System for Data Analysis and Knowledge Discovery

Information Retrieval and. Knowledge Discovery with FCART

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Planning the Installation and Installing SQL Server

An Approach to Implement Map Reduce with NoSQL Databases

A Performance Analysis of Distributed Indexing using Terrier

Processing Large Amounts of Images on Hadoop with OpenCV

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Search and Real-Time Analytics on Big Data

Big Data Mining Services and Knowledge Discovery Applications on Clouds

The Open Source Knowledge Discovery and Document Analysis Platform

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Database Marketing, Business Intelligence and Knowledge Discovery

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

Licenze Microsoft SQL Server 2005

DYNAMIC QUERY FORMS WITH NoSQL

EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING

Pattern Insight Clone Detection

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

IT services for analyses of various data samples

Hadoop Ecosystem B Y R A H I M A.

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Search and Information Retrieval

Knowledge Discovery from patents using KMX Text Analytics

Inmagic Content Server Standard and Enterprise Configurations Technical Guidelines

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Designing Applications with Distributed Databases in a Hybrid Cloud

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BRINGING INFORMATION RETRIEVAL BACK TO DATABASE MANAGEMENT SYSTEMS

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

World-wide online monitoring interface of the ATLAS experiment

Things Made Easy: One Click CMS Integration with Solr & Drupal

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

IBM Rational Asset Manager

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

On a Hadoop-based Analytics Service System

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Data Mining with Hadoop at TACC

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Search and Data Mining: Techniques. Introduction Anna Yarygina Boris Novikov

SQL Server 2005 Features Comparison

Information Retrieval Elasticsearch

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

ICONICS Choosing the Correct Edition of MS SQL Server

RevoScaleR Speed and Scalability

Hadoop Technology for Flow Analysis of the Internet Traffic

Microsoft Dynamics CRM 2011 Guide to features and requirements

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Handling Big(ger) Logs: Connecting ProM 6 to Apache Hadoop

Graph Database Performance: An Oracle Perspective

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

MicroStrategy Course Catalog

How To Make Sense Of Data With Altilia

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Capacity Planning for Microsoft SharePoint Technologies

Machine Learning for Big Data Texts, Signals, Images and Video

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Semantic Stored Procedures Programming Environment and performance analysis

Drivers to support the growing business data demand for Performance Management solutions and BI Analytics

Performance and Scalability Overview

BIG DATA ANALYTICS For REAL TIME SYSTEM

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Safe Harbor Statement

Prognoz Payment System Data Analysis. Description of the solution

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

System Requirements Table of contents

Sisense. Product Highlights.

Log Mining Based on Hadoop s Map and Reduce Technique

What s Cooking in KNIME

Intinno: A Web Integrated Digital Library and Learning Content Management System

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

An Ontology Based Text Analytics on Social Media

ANALYTICS IN BIG DATA ERA

The New ADS Search Interface and API

Océ PRISMA archive software. Archiving made easy. Powerful, high-volume. archiving software

Big Systems, Big Data

The Ultimate Guide to Buying Business Analytics

Enabling the Big Data Commons through indexing of data and their interactions

Discovering Business Insights in Big Data Using SQL-MapReduce

Dependency Free Distributed Database Caching for Web Applications and Web Services

GenomeSpace Architecture

Understanding the Benefits of IBM SPSS Statistics Server

NoSQL Roadshow Berlin Kai Spichale

An approach to grid scheduling by using Condor-G Matchmaking mechanism

Effective User Navigation in Dynamic Website

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Preprocessing Web Logs for Web Intrusion Detection

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

Client Overview. Engagement Situation. Key Requirements

APPLYING CASE BASED REASONING IN AGILE SOFTWARE DEVELOPMENT

Transcription:

Full-text Search in Intermediate Data Storage of FCART Alexey Neznanov, Andrey Parinov National Research University Higher School of Economics, 20 Myasnitskaya Ulitsa, Moscow, 101000, Russia ANeznanov@hse.ru, AParinov@hse.ru Abstract. The speed of full-text search directly affects the process of text analysis. Search engine creates a text index, which is used for fast full-text search. Solr and ElasticSearch are two popular search engines. A text analysis system requires fast implementing searching and indexing at the same time. This paper describes preprocessing workflow of the analysis system called Formal Concept Analysis Research Toolbox (FCART) and experiment of searching and indexing social networking service data at the same time. Results of the experiment show which search engine is better as the core of FCART search subsystem. Keywords: Formal Concept Analysis, Knowledge Extraction, Data Mining, Software, Big Data, Social Network Analysis. 1 Introduction Formal Concept Analysis Research Toolbox (FCART)[1] is a data analysis system, which supports texts knowledge discovery techniques, including those based on Formal Concept Analysis[2], clustering [3], multimodal clustering [4, 5], pattern structures[6]. The system supports iterative methodology of knowledge discovery. The goal of developing FCART is to create a system for handy texts analysis from external data sources, e.g. SQL databases, NoSql databases and Social Network Services. In previous papers, we have described the system architecture, main workflow and stages of data extraction from various external sources. Fast search of relevant documents in a big dataset is an important function of a text analysis system. Development of search engine from scratch is time-consuming and unnecessary since many search engines are known so far, Solr [7] and ElasticSearch [8] being among most popular ones [9]. In this paper we describe key features of search engines mentioned above, describe preprocessing workflow of FCART, requirements to the full-text search engine as a part of an analytic system and then describe the experiment of searching and indexing at the same time of data that was gathered from social network service LiveJournal [10]. Based on the results of the experiment, we choose a search engine that would better serve as the core of FCART full-text search subsystem.

2 FCART architecture and preprocessing workflow Formal Concept Analysis (FCA) has many applications in different fields [11,12]. Formal Concept Analysis Research Toolbox (FCART) is an integrated environment for knowledge and data engineers with a set of research tools based on Formal Concept Analysis. FCART is built as a distributed web-based system with a thick client. In new distributed version FCART consists of four parts: 1. FCART Intermediate Data Storage (IDS) for storage and preprocessing (initial converting, data preprocessing, etc.) of big datasets; 2. FCART Web-based solvers (Web-Solvers) for independent resource-intensive computations; 3. FCART Auth Server for authentication and authorization; 4. FCART Thick Client for interactive data processing and visualization in the integrated environment. From an analyst point of view, basic FCA workflow in FCART has five stages (see Fig. 1). 1. Filling Intermediate Data Storage of FCART from various external SQL, XML or JSON-like data sources (querying external source is described by External Data Query Description (EDQD); 2. Data indexing and preprocessing; 3. Loading a data snapshot from a local storage to the current analytic session (snapshot described by Snapshot Profile). Data snapshot is a data table with annotated structures and text attributes, loaded in the system by accessing external data sources; 4. Transforming the snapshot to a binary context (transformation described by Scaling Query); 5. Building and visualizing concept lattices and other artifacts arising from formal contexts (binary data tables) within an analytic session. 72

External Data Source FCART IDS Import/Export Tools FCART Client Import/Export Tools FCART Intermediate Data Storage Local Session Data Base Analytic Artifacts Pattern Structures Clusters Other artifacts External Data Set JSON Collection Data Snapshot (Multivalued Context) Binary Context Concept Lattice External Data Query Description Snapshot Profile Scaling Query + Graph Generators Fig. 1. Main FCA workflow of FCART All steps of data analysis are accomplished with the use of Client program. In the background, Client interacts with Server using Server REST interface commands. In the previous articles, we described Server REST interface in detail. In the current version of FCART the preprocessing workflow consists of two steps: 1. Normalizing incoming documents 2. Creating index At normalizing step, FCART creates a document in JSON format, selects one text field as identifier and calculates document metrics (size, word count, etc.). At index creating step, FCART creates index on one or multiple text fields. This index is used for fast data search. 3 Comparison of popular full-text libraries and systems Full-text indexing library adds content to a full-text index. It then allows performing queries on this index, returning results either ranked by the relevance to the query or sorted by an arbitrary field such as document's last modified date. The library gets fast search responses because -- instead of searching the text directly -- it searches the index. The library cannot be used independently, it can be used as a database plugin or as a part of another software system. There are several key features of the full-text search library: 1. Search, initial indexing and reindexing speed; 2. Support of languages and dialects; 3. List of supported programming languages; 4. Format of supported documents; 73

5. Stemming algorithms; 6. List of metrics and ranking methods; The full-text indexing system is a standalone, independent product. It is based on fulltext indexing library and provides more features. For example, Solr and ElasticSearch are based on same library called Lucene[13]. Besides the features mentioned above, the full-text system provides the following ones: 1. Support of multiple indexes based on different fields 2. Scalability 3. Support for complex search expressions 4. Ranking and grouping of search results 5. Ease of use and ease of integration with data storage (e.g. MongoDb) The final value of each feature of a particular software system depends on CPU, number of cores, total amount of memory, etc. Solr and ElasticSearch provide very similar sets of features [14]. These sets of features are enough to satisfy our current needs. We deploy ElasticSearch and Solr on our server and execute similar experiment. In the next section we describe an experiment on indexing data gathered from the social network service Liverjournal. 4 Indexing data description LiveJournal is a popular in Russia social blogging system. Using LiveJournal Web API we have extracted 1000,100 000, 1000000 person s profiles. Then we carried out some classical SNA experiments and tested improved subsystems of FCART. Initially, IDS takes five initial profiles and initiates process of downloading the neighbourhood of these persons by relation has a friend. Each profile consists of several blocks of fields : 3. Nickname ( nick unique identifier). 4. List of friends ( friends array of nicknames). 5. List of users who checked this profile as a friend ( friendsof array of nicknames). 6. List of interests ( interest array of tags). 7. List of watching communities ( watching array of community names). 8. List of memberships in communities ( memberof array of community names). 9. List of posts (not used in next stages). 10. Other personal information. Table 1 contains time results of gathering data. Figure 2 illustrates results of gathering data. Number of parallel Time for gathering, minutes 1000 10000 100000 1000000 74

threads Profiles Profiles Profiles Profiles 1 18 195 3671 42710 10 2 21 291 4483 100 0.2 3 32 471 Table 1. Time for extracting data from LiveJournal Fig. 2. Time for extracting data from LiveJournal 5 Experiment of searching and indexing data Indexing system Solr 5.4 and ElasticSearch 2.0 have been deployed on a computer with installed INTEL Xeon Processor E5-2670 (2,60 GHz, 4 cores) 8Gb of RAM [15]. We have compared time of search the systems spend on indexing data. Table 2 contains results of the experiment. Fig 2 illustrates results of the experiment. Experiment shows advantage ElasticSearch over Solr from 3 to 7 times. Time for indexing, milliseconds 1000 10000 100000 1000000 profiles profiles profiles profiles Solr 3000 5700 6200 6900 ElasticSearch 5 16 29 42 Table 2. Solr and ElasticSearch time of searching while performing indexing 75

Fig. 3. Solr and ElasticSearch time of searching while performing indexing Conclusion In this paper, we have described key features of two popular full-text search engines, architecture and preprocessing workflow of FCART software. The case of searching data and indexing at the same time was considered. Our experiment have shown significant advantage ElasticSearch over Solr. In future work we will use ElasticSearch as the core of the search subsystem. Next FCART release will be integrated with ConceptCloud[16] and expanded by the function of indexing texts using parsed thickets index[17] Acknowledgements This work was carried out by the authors within the project Data mining based on applied ontologies and lattices of closed descriptions supported by the Basic Research Program of the National Research University Higher School of Economics. References 1. Neznanov, A., Ilvovsky, D., Parinov, A. Advancing FCA Workflow in FCART System for Knowledge Discovery in Quantitative Data // 2nd International Conference on Information Technology and Quantitative Management (ITQM-2014), Procedia Computer Science, 31, 2014, pp. 201-210. 2. Ganter, B., Wille R. Formal Concept Analysis: Mathematical Foundations, Springer, 1999. 76

3. Neznanov A.A., Ilvovsky D.A., Kuznetsov S.O. FCART: A New FCA-based System for Data Analysis and Knowledge Discovery, Contributions to the 11th International Conference on Formal Concept Analysis, 2013. pp. 31-44. 4. Mirkin, B. Mathematical Classification and Clustering, Springer, 1996. 5. Ignatov, D.I., Kuznetsov, S.O., Magizov, R.A., Zhukov, L.E. From Triconcepts to Triclusters. Proc. of 13th International Conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC-2011), LNCS/LNAI Volume 6743/2011, Springer (2011), pp. 257-264. 6. Kuznetsov, S.O. Pattern Structures for Analyzing Complex Data // Proc. of 12th International conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC-2009), 2009, pp. 33-44. 7. Apache Solr (http://lucene.apache.org/solr/) 8. ElasticSearch (https://www.elastic.co/) 9. Ranking of Search Engines (http://db-engines.com/en/ranking/search+engine) 10. LiveJournal (http://livejournal.com) 11. Jonas Poelmans, Sergei O. Kuznetsov, Dmitry I. Ignatov, Guido Dedene, Formal Concept Analysis in knowledge processing: A survey on models and techniques. In: Expert Systems with Applications, Vol. 40. No. 16, pp. 6601-6623, 2013. 12. Poelmans J., Ignatov, D.I., Kuznetsov, S.O., Dedene, G. Formal concept analysis in knowledge processing: A survey on applications. Expert Systems with Applications, 40, 2013, pp. 6538-6560. 13. Apache Lucene (https://lucene.apache.org/) 14. Apache Solr vs Elasticsearch (http://solr-vs-elasticsearch.com/) 15. Zeus (http://zeus.hse.ru:8080) 16. Gillian J. Greene and Bernd Fischer. 2014. ConceptCloud: a tagcloud browser for software archives. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 759-762. 17. Galitsky B., Ilvovsky D., Kuznetsov S. Text integrity assessment: Sentiment profile vs rhetoric structure, in: Computational Linguistics and Intelligent Text Processing. 16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015, Proceedings, Part II. Vol. 9042. Springer International Publishing, 2015. P. 126-139. 77