IT services for analyses of various data samples

Transcription

1 IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical University of Košice, Faculty of Electrical Engineering and Informatics, Department of Cybernetics and Artificial Intelligence, Letná 9/B, Košice, Slovakia {jan.paralic, frantisek.babic, martin.sarnovsky, peter.butka, cecilia.havrilova, miroslava.muchova, michal.puheim, martin.mikula, Abstract. Nowadays efficient processing and analysis of various data samples is becoming an important means how to obtain a competitive advantage on the market. In this situation analytical services available through the cloud represent an interesting solution how to offer a variety of methods and algorithms in and easy-usable form. Set of services described in this paper were designed and implemented based on multiannual research and project activities in domains as text mining, distributive and parallel computing, sentiment analysis, topic modelling and data science in general. We present thee main subsets of services with a short description of the used approaches and technologies. Implemented methods and algorithms have been continuously tested and deployed within previous national or EU projects, dissertation or master thesis, etc. Keywords: data, analysis, services 1 Introduction An important condition for the proper functioning and efficient performance of the presented services is a technical infrastructure providing necessary computing power and data capacity. We continuously build our own computing environment in which we can not only deploy and test our services, but we are also able to offer them as a SaaS (Software as a Service) for any other potential users. Some basic characteristics of the proposed architecture are the following: Private cloud managed with CoreOS, a lightweight Linux distribution that focuses on managing Linux containers. Web services decoupled into containers using Docker, software for automating deployment of applications into Linux containers. OS and application is combined together into software container, which can then be launched inside virtualization software. REST-like web services with dedicated Web portal for user interaction. Programmatic calls are handled through the REST-like API, a current de facto standard for web services, which provides simpler alternative to the SOAP 82

2 ISBN and WS-* standards. User interactions are conducted through a web application that itself uses the same web services in the background. Programming languages as Java, C#, Python and R for business logic and analytics. 2 Analytical platform In this section each common subset of services is described in more details: Effective management, storage and analyses of large collections of text documents using a sufficiently powerful computing platform of a private cloud infrastructure. Analyses of transaction data from electronic shops in order to provide recommendations for customers based on their buying behavior and/or buying behavior of previous customers with similar characteristics. Analyses of textual data, e.g. data from web discussions to identify overall customer satisfaction with given products or identification of major topics occurring in given collection of textual data as well as sentiment of their authors about particular topics. 2.1 Big text data analysis Services for big text data analysis are designed and implemented in line with current state-of-the-art technologies and frameworks including newly designed and implemented methods and algorithms, accessible through web portal or API [8]. The backend implementation is based on the JBOWL library for text mining and supporting (i.e. indexation and preprocessing) services [5]. It is an internally developed Java library for text mining tasks. Particular methods were re-implemented into the distributed versions using the GridGain API [3, 4]. GridGain is the framework for distributed applications development, including the real-time big data analytical applications. User interface is implemented using JSP (Java Server Pages) and interactive visualizations of the models are implemented in the Processing framework. We offer following: Services for management and manipulation with text document collections services for dataset manipulation including dataset management. Services for indexing, complex statistical text analyses and preprocessing tasks services for preprocessing of text documents including various preprocessing methods such as stopwords removal, stemming or several weighting scheme computing methods. Services for classification models building implemented in distributed versions algorithms for classification model building, following classifier are implemented to utilize the distributed computing resources by using the GridGain framework for distributed computing: decision tree, K-NN an boosting compound classifier. 83

3 Services for clustering of the text documents in distributed versions algorithms for clustering models building, similar to classification models, implemented using GridGain: K-Means and GHSOM [7]. 2.2 Process and event log mining Next subset of services deals with the behavioral analysis of IT portal users (such as e-shop customers, social network users etc.). Actions of these users are usually mirrored by access and event logs (e.g. access to the IT portal, participation within the campaign, display of a specific product in e-shop, etc.). Our services can analyze these logs and extract various types of knowledge (e.g. classification rules, segmentation and clustering based on similar behavior, behavior patterns and recommendations). The recommender system provides user specific information which can be used for marketing campaigns, web personalized recommendations, advertisement etc. In a simple scenario, the system may predict the products which certain user is likely to buy. In this case, data about single user (user id, item id, ratings etc.) are loaded into the web GUI of our recommender system. A set of algorithms analyzes this data and produces a single data file, specific to the corresponding user. The file contains information about items which may be potentially interesting for the user. The system utilizes various algorithms, such as Matrix Factorization, Item Based k-nn, User Based knn, Weighted Regularized Matrix Factorization and Bayesian Personalized Ranking Matrix Factorization. The algorithms used perform a collaborative approach in which several models are created. Each individual model is built using a different method and represents a specific personal assessment of the user. Produced models are further tested, mutually compared and finally combined into a single hybrid model which combines a variety of recommendation techniques with goal to achieve the best performance possible. The system is designed and implemented using the RapidMiner analytics platform. An accompanying web GUI is implemented in PHP language. Results of the behavior rule mining service are in form of prediction rules usable for decision-making and support in areas such as management, marketing, customer segmentation, classification, behavior prediction etc. The service processes user data and event logs by means of data aggregation, clustering, classification and prediction. The aggregations are created using a predefined set of operators (such as count, sum, frequency etc.) and the results are filtered using Hierarchical Agglomerative Clustering leaving only the most relevant aggregated data (the predictors). The metric used for clustering is based on correlation coefficients obtained using either Pearson s product-moment correlation coefficient (for numeric event attributes) or Pearson's chi-square test of independence (for nominal event attributes). Finally, a decision tree is created using the aggregated data and a set of rules is extracted from it. This set is sorted according to the number of data examples to which a single rule applies correctly. Only the most significant rules are returned. Both components are implemented as web services communicating via JSON messages. 84

4 ISBN Sentiment and theme analysis These services provide automatic detection of textual document themes with the filtering option, i.e. access to relevant articles only. For search engines it can be implemented as an extension with possibility to search documents by their themes instead of words matching. In the product sales area and in discussions about products the themes detection is able to recognize main topics that interest customers at most. For public sector the detection of document themes can be used as good tool for e.g. detection of main politicians affairs and bring the reflection of the public persons. System for topic modeling [9] is created as a library in Java. It is supported by two frameworks Gate and Mallet. This system is able to process input documents automatically and display discovered topics with their description at output. Required number of output topics can be passed as input parameter by the user or can be automatically estimated by the system. For the topic modelling we used Latent Dirichlet Allocation (LDA) methodology. Dictionary based approach [2] was used for discussion polarity detection. It uses lexicons, which contain words useful for classification. It was created lexicon for opinion classification, which contains around 1200 words in nominative plural. Words have assigned strength of polarity. They are divided into 4 groups (positive, negative, opposite and intensification). These words are then used for text classification into positive or negative class. Algorithm compares words in text with words in dictionary. Final polarity value of a sentence is computed as the sum of values of all polarity words in this sentence. Final text polarity value depends on values of its sentences. 3 Conclusion The aim of our set of analytical services is not to compete with big analytical platforms supported by the most important vendors and actors in this domain. We provide it as customized approach for selected business case, e.g. on the level of medium or small companies that need to have an easy to use solution without necessary deeper knowledge about implemented methods or algorithms. On the other hand, it is possible to modify and improve the available services within own research activities. Acknowledgment. The work presented in this paper was partially supported by the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/1147/12 (50%) and as the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: , supported by the Research & Development Operational Programme funded by the ERDF (50%). References 1. Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent Dirichlet Allocation. In: Journal of Machine Learning Research 3, 2003, pp

5 2. Mikula, M., Machová, K.: Classification of opinion in conversational content. In: IEEE SAMI 2015 Proceedings, Herľany, Slovensko, 2015, pp Butka et al.: Distributed task-based execution engine for support of text-mining processes. In: IEEE SAMI 2009 Proceedings, Herľany, Slovensko, 2009, pp Bednár, P., Butka, P.: Task-based execution engine for JBOWL. In: WIKT 2008 Proceedings, Smolenice, Bratislava, STU, 2009, pp Bednár, P., Butka, P., Paralič, J.: Java library for support of text mining and retrieval. In: Znalosti 2005, Stará Lesná, VŠB-TU Ostrava, 2005, pp Bednár, P., Sarnovský, M., Demko, V.: RDF vs. NoSQL databases for the Semantic Web applications. In: IEEE SAMI 2014 Proceedings, Herľany, Slovensko, 2014, pp Sarnovský, M.: Design and implementation of Interactive visualization of GHSOM clustering algorithm for text mining tasks. In: International Journal of Research in Information Technology, Vol. 2, No. 7 (2014), pp Sarnovský, M.: Design and implementation of the cloud based application for text mining tasks. In: Data Mining and Knowledge Engineering, Vol. 6, No. 6 (2014), pp Smatana, M. et al.: Active learning enhanced semi-automatic annotation tool for aspectbased sentiment analysis. In: IEEE SISY 2013 Proceedings, Subotica, Serbia, 2013, pp