IT services for analyses of various data samples



Similar documents
Data Mining Solutions for the Business Environment

An Introduction to Data Mining

Search and Data Mining: Techniques. Introduction Anna Yarygina Boris Novikov

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

An Overview of Knowledge Discovery Database and Data mining Techniques

USING BIG DATA FOR INTELLIGENT BUSINESSES

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Pentaho Data Mining Last Modified on January 22, 2007

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

SPATIAL DATA CLASSIFICATION AND DATA MINING

A Survey on Product Aspect Ranking

Using Data Mining for Mobile Communication Clustering and Characterization

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

The Prophecy-Prototype of Prediction modeling tool

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

A Statistical Text Mining Method for Patent Analysis

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Master Specialization in Knowledge Engineering

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

CUSTOMER Presentation of SAP Predictive Analytics

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE

Hexaware E-book on Predictive Analytics

Knowledge Discovery from patents using KMX Text Analytics

Introduction Predictive Analytics Tools: Weka

Big Data Analytics and Healthcare

Hadoop Technology for Flow Analysis of the Internet Traffic

How To Make Sense Of Data With Altilia

Advanced analytics at your hands

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Analysis Tools and Libraries for BigData

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

How To Solve The Kd Cup 2010 Challenge

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Database Marketing, Business Intelligence and Knowledge Discovery

Application of Predictive Model for Elementary Students with Special Needs in New Era University

MicroStrategy Course Catalog

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Big Data Architect Certification Self-Study Kit Bundle

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Full-text Search in Intermediate Data Storage of FCART

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Data Refinery with Big Data Aspects

Chapter 20: Data Analysis

OPC COMMUNICATION IN REAL TIME

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis

Ensembles and PMML in KNIME

Advanced In-Database Analytics

Cleaned Data. Recommendations

Social Media Mining. Data Mining Essentials

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

Web Document Clustering

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

CLASSIFICATION AND CLUSTERING METHODS IN THE DECREASING OF THE INTERNET COGNITIVE LOAD

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Text Analytics Software Choosing the Right Fit

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Assignment # 1 (Cloud Computing Security)

Statistical Feature Selection Techniques for Arabic Text Categorization

An Introduction to WEKA. As presented by PACE

A Grid Architecture for Manufacturing Database System

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

IBM Social Media Analytics

ifinder ENTERPRISE SEARCH

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Data Mining Algorithms Part 1. Dejan Sarka

Transcription:

IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical University of Košice, Faculty of Electrical Engineering and Informatics, Department of Cybernetics and Artificial Intelligence, Letná 9/B, 042 00 Košice, Slovakia {jan.paralic, frantisek.babic, martin.sarnovsky, peter.butka, cecilia.havrilova, miroslava.muchova, michal.puheim, martin.mikula, gabriel.tutoky}@tuke.sk Abstract. Nowadays efficient processing and analysis of various data samples is becoming an important means how to obtain a competitive advantage on the market. In this situation analytical services available through the cloud represent an interesting solution how to offer a variety of methods and algorithms in and easy-usable form. Set of services described in this paper were designed and implemented based on multiannual research and project activities in domains as text mining, distributive and parallel computing, sentiment analysis, topic modelling and data science in general. We present thee main subsets of services with a short description of the used approaches and technologies. Implemented methods and algorithms have been continuously tested and deployed within previous national or EU projects, dissertation or master thesis, etc. Keywords: data, analysis, services 1 Introduction An important condition for the proper functioning and efficient performance of the presented services is a technical infrastructure providing necessary computing power and data capacity. We continuously build our own computing environment in which we can not only deploy and test our services, but we are also able to offer them as a SaaS (Software as a Service) for any other potential users. Some basic characteristics of the proposed architecture are the following: Private cloud managed with CoreOS, a lightweight Linux distribution that focuses on managing Linux containers. Web services decoupled into containers using Docker, software for automating deployment of applications into Linux containers. OS and application is combined together into software container, which can then be launched inside virtualization software. REST-like web services with dedicated Web portal for user interaction. Programmatic calls are handled through the REST-like API, a current de facto standard for web services, which provides simpler alternative to the SOAP 82

ISBN 978-80-553-2271-1 and WS-* standards. User interactions are conducted through a web application that itself uses the same web services in the background. Programming languages as Java, C#, Python and R for business logic and analytics. 2 Analytical platform In this section each common subset of services is described in more details: Effective management, storage and analyses of large collections of text documents using a sufficiently powerful computing platform of a private cloud infrastructure. Analyses of transaction data from electronic shops in order to provide recommendations for customers based on their buying behavior and/or buying behavior of previous customers with similar characteristics. Analyses of textual data, e.g. data from web discussions to identify overall customer satisfaction with given products or identification of major topics occurring in given collection of textual data as well as sentiment of their authors about particular topics. 2.1 Big text data analysis Services for big text data analysis are designed and implemented in line with current state-of-the-art technologies and frameworks including newly designed and implemented methods and algorithms, accessible through web portal or API [8]. The backend implementation is based on the JBOWL library for text mining and supporting (i.e. indexation and preprocessing) services [5]. It is an internally developed Java library for text mining tasks. Particular methods were re-implemented into the distributed versions using the GridGain API [3, 4]. GridGain is the framework for distributed applications development, including the real-time big data analytical applications. User interface is implemented using JSP (Java Server Pages) and interactive visualizations of the models are implemented in the Processing framework. We offer following: Services for management and manipulation with text document collections services for dataset manipulation including dataset management. Services for indexing, complex statistical text analyses and preprocessing tasks services for preprocessing of text documents including various preprocessing methods such as stopwords removal, stemming or several weighting scheme computing methods. Services for classification models building implemented in distributed versions algorithms for classification model building, following classifier are implemented to utilize the distributed computing resources by using the GridGain framework for distributed computing: decision tree, K-NN an boosting compound classifier. 83

Services for clustering of the text documents in distributed versions algorithms for clustering models building, similar to classification models, implemented using GridGain: K-Means and GHSOM [7]. 2.2 Process and event log mining Next subset of services deals with the behavioral analysis of IT portal users (such as e-shop customers, social network users etc.). Actions of these users are usually mirrored by access and event logs (e.g. access to the IT portal, participation within the email campaign, display of a specific product in e-shop, etc.). Our services can analyze these logs and extract various types of knowledge (e.g. classification rules, segmentation and clustering based on similar behavior, behavior patterns and recommendations). The recommender system provides user specific information which can be used for marketing campaigns, web personalized recommendations, advertisement etc. In a simple scenario, the system may predict the products which certain user is likely to buy. In this case, data about single user (user id, item id, ratings etc.) are loaded into the web GUI of our recommender system. A set of algorithms analyzes this data and produces a single data file, specific to the corresponding user. The file contains information about items which may be potentially interesting for the user. The system utilizes various algorithms, such as Matrix Factorization, Item Based k-nn, User Based knn, Weighted Regularized Matrix Factorization and Bayesian Personalized Ranking Matrix Factorization. The algorithms used perform a collaborative approach in which several models are created. Each individual model is built using a different method and represents a specific personal assessment of the user. Produced models are further tested, mutually compared and finally combined into a single hybrid model which combines a variety of recommendation techniques with goal to achieve the best performance possible. The system is designed and implemented using the RapidMiner analytics platform. An accompanying web GUI is implemented in PHP language. Results of the behavior rule mining service are in form of prediction rules usable for decision-making and support in areas such as management, marketing, customer segmentation, classification, behavior prediction etc. The service processes user data and event logs by means of data aggregation, clustering, classification and prediction. The aggregations are created using a predefined set of operators (such as count, sum, frequency etc.) and the results are filtered using Hierarchical Agglomerative Clustering leaving only the most relevant aggregated data (the predictors). The metric used for clustering is based on correlation coefficients obtained using either Pearson s product-moment correlation coefficient (for numeric event attributes) or Pearson's chi-square test of independence (for nominal event attributes). Finally, a decision tree is created using the aggregated data and a set of rules is extracted from it. This set is sorted according to the number of data examples to which a single rule applies correctly. Only the most significant rules are returned. Both components are implemented as web services communicating via JSON messages. 84

ISBN 978-80-553-2271-1 2.3 Sentiment and theme analysis These services provide automatic detection of textual document themes with the filtering option, i.e. access to relevant articles only. For search engines it can be implemented as an extension with possibility to search documents by their themes instead of words matching. In the product sales area and in discussions about products the themes detection is able to recognize main topics that interest customers at most. For public sector the detection of document themes can be used as good tool for e.g. detection of main politicians affairs and bring the reflection of the public persons. System for topic modeling [9] is created as a library in Java. It is supported by two frameworks Gate and Mallet. This system is able to process input documents automatically and display discovered topics with their description at output. Required number of output topics can be passed as input parameter by the user or can be automatically estimated by the system. For the topic modelling we used Latent Dirichlet Allocation (LDA) methodology. Dictionary based approach [2] was used for discussion polarity detection. It uses lexicons, which contain words useful for classification. It was created lexicon for opinion classification, which contains around 1200 words in nominative plural. Words have assigned strength of polarity. They are divided into 4 groups (positive, negative, opposite and intensification). These words are then used for text classification into positive or negative class. Algorithm compares words in text with words in dictionary. Final polarity value of a sentence is computed as the sum of values of all polarity words in this sentence. Final text polarity value depends on values of its sentences. 3 Conclusion The aim of our set of analytical services is not to compete with big analytical platforms supported by the most important vendors and actors in this domain. We provide it as customized approach for selected business case, e.g. on the level of medium or small companies that need to have an easy to use solution without necessary deeper knowledge about implemented methods or algorithms. On the other hand, it is possible to modify and improve the available services within own research activities. Acknowledgment. The work presented in this paper was partially supported by the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/1147/12 (50%) and as the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF (50%). References 1. Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent Dirichlet Allocation. In: Journal of Machine Learning Research 3, 2003, pp. 993-022. 85

2. Mikula, M., Machová, K.: Classification of opinion in conversational content. In: IEEE SAMI 2015 Proceedings, Herľany, Slovensko, 2015, pp. 227-231. 3. Butka et al.: Distributed task-based execution engine for support of text-mining processes. In: IEEE SAMI 2009 Proceedings, Herľany, Slovensko, 2009, pp. 29-34. 4. Bednár, P., Butka, P.: Task-based execution engine for JBOWL. In: WIKT 2008 Proceedings, Smolenice, Bratislava, STU, 2009, pp. 65-68. 5. Bednár, P., Butka, P., Paralič, J.: Java library for support of text mining and retrieval. In: Znalosti 2005, Stará Lesná, VŠB-TU Ostrava, 2005, pp. 162-169. 6. Bednár, P., Sarnovský, M., Demko, V.: RDF vs. NoSQL databases for the Semantic Web applications. In: IEEE SAMI 2014 Proceedings, Herľany, Slovensko, 2014, pp. 361-364. 7. Sarnovský, M.: Design and implementation of Interactive visualization of GHSOM clustering algorithm for text mining tasks. In: International Journal of Research in Information Technology, Vol. 2, No. 7 (2014), pp. 146-151. 8. Sarnovský, M.: Design and implementation of the cloud based application for text mining tasks. In: Data Mining and Knowledge Engineering, Vol. 6, No. 6 (2014), pp. 261-264. 9. Smatana, M. et al.: Active learning enhanced semi-automatic annotation tool for aspectbased sentiment analysis. In: IEEE SISY 2013 Proceedings, Subotica, Serbia, 2013, pp. 191-194. 86