SMTP: Stedelijk Museum Text Mining Project



Similar documents
Data Mining Yelp Data - Predicting rating stars from review text

Healthcare Measurement Analysis Using Data mining Techniques

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Feature. Applications of Business Process Analytics and Mining for Internal Control. World

Neural Networks for Sentiment Detection in Financial Text

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

ADVANCED DATA VISUALIZATION

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

A Framework of User-Driven Data Analytics in the Cloud for Course Management

Research of Postal Data mining system based on big data

Big Data Analytics in Mobile Environments

Degree in Art and Design

Fogbeam Vision Series - The Modern Intranet

TEXT ANALYTICS INTEGRATION

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

Connecting the Dots in Visual Analysis

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

GEO-VISUALIZATION SUPPORT FOR MULTIDIMENSIONAL CLUSTERING

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Delivering Smart Answers!

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Text Mining - Scope and Applications

Most Effective Communication Management Techniques for Geographically Distributed Project Team Members

Big Data Analytics. Bringing Value out of Volume

Visualization methods for patent data

Term extraction for user profiling: evaluation by the user

Model-Based Cluster Analysis for Web Users Sessions

Estimation of Human Mobility Patterns and Attributes Analyzing Anonymized Mobile Phone CDR:

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

A Survey on Web Research for Data Mining

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Better planning and forecasting with IBM Predictive Analytics

Visual Analysis Tool for Bipartite Networks

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

Big Data Text Mining and Visualization. Anton Heijs

Online Credit Card Application and Identity Crime Detection

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

Towards Inferring Web Page Relevance An Eye-Tracking Study

Clustering Data Streams

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

Research Article EFFICIENT TECHNIQUES TO DEAL WITH BIG DATA CLASSIFICATION PROBLEMS G.Somasekhar 1 *, Dr. K.

Computer Forensics Application. ebay-uab Collaborative Research: Product Image Analysis for Authorship Identification

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Survey on Web Mining From Web Server Log

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Cross-Domain Collaborative Recommendation in a Cold-Start Context: The Impact of User Profile Size on the Quality of Recommendation

Business Challenges and Research Directions of Management Analytics in the Big Data Era

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

An Introduction to Data Mining

Random Forest Based Imbalanced Data Cleaning and Classification

A Knowledge Management Framework Using Business Intelligence Solutions

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Geo Data Mining and Visual Analytics

2.0 COMMON FORMS OF DATA VISUALIZATION

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

Driving Innovation in Licensing Through Competitive Intelligence and Big Data Analytics

Hexaware E-book on Predictive Analytics

Data Mining and Analytics in Realizeit

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Edifice an Educational Framework using Educational Data Mining and Visual Analytics

Text Analytics. A business guide

Web Log Based Analysis of User s Browsing Behavior

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

Transcription:

SMTP: Stedelijk Museum Text Mining Project Jeroen Smeets Maastricht University smeetsjeroen@hotmail.com Prof. Dr. Ir. Johannes C. Scholtes Maastricht University j.scholtes@maastrichtuniversity.nl Dr. Claartje Rasterhoff CREATE University of Amsterdam C.Rasterhoff@uva.nl Dr. Margriet Schavemaker Stedelijk Museum Amsterdam M.Schavemaker@stedelijk.nl Introduction This paper addresses how text-mining, machine-learning and information retrieval algorithms from the field of artificial intelligence can be used to analyze Art-Research archives and conduct (art-) historical research. Two aspects are to focus on in order to gain quick insight into the archive: relations between groups of people using community detection, and global content changes over time using topic modeling. To develop and test the validity and relevance of existing tools, close collaboration was established between the AI researchers, museum staff, and researchers in CREATE, a digital humanities project that investigates the development of cultural industries in Amsterdam over the course of the last five centuries. Data The research draws on two datasets. The principal dataset is the digitized archive of the Stedelijk Museum Amsterdam, a renowned international museum dedicated to modern and contemporary art and design. The archive of the Stedelijk Museum Amsterdam is a static collection of approximately 160.000 OCR digitized documents from the period 1930-1980. The second dataset is drawn from Delpher (Koninklijke Bibliotheek Nederland, 2015), which provides a collection of digitized newspapers, books and magazines that is available for research. A selection of newspapers was made that is used as an additional dataset for this project.

Methodology The methodology uses two approaches to obtain a quick and detailed overview of the content of a digitized archive that contains unstructured information. Relation networks and Community Detection The first approach focuses on the relations between named entities and aims at finding communities in the relation network. In its most basic form, a relation between two named entities can be said to exist when both occur in the same document. The relation is characterized by its strength, the number of documents in which both named entities occur, and by the sentiment content of the documents. The hypothesis is that relations between individuals with a high sentiment are more interesting than relations with a low sentiment. This is because sentiments around trigger-events are often higher than around common-day events. Community detection algorithms (Fortunato, 2010) are applied to the relation network, shown in Figure 1. Communities such as group exhibitions, art movements or a group of artists closely related to the museum director, could be identified with the help of a museum expert. Figure 1: Found communities for graphic artists in the archive of the Stedelijk Museum Name Extraction In order to build the relation network, a name extraction method is developed that is able to handle multiple causes of name variations such as OCR induced errors, spelling mistakes, name abbreviations and first and last name combinations. The method makes use of lists of name

variations that are extracted from the document collection, relying on a name database such as RKDartists& (RKD, 2015). A similarity score is calculated between the original name and the possible name variation, based on an n-gram set matching technique described in (Song and Chen, 2007). Using a threshold of 0.9 on the similarity score, the method was tested on 50 randomly chosen names, resulting in an average precision of 81 percent. Time based Topic Modeling The information content and its evolution over time is analyzed using topic modeling. A time-based collective matrix factorization technique is used, as suggested in (Vaca et al., 2014). It is based on Non-Negative Matrix Factorization (NMF) (Arora et al., 2012), extended by introducing a topic transition matrix that allows to track topics as they emerge, evolve and fade over time. The algorithm was applied to both the archive of the Stedelijk Museum Amsterdam and newspaper articles from the Delpher database. The results are visualized over time in the form of stacked topic rivers (Wei et al., 2010), shown in Figure 2 and Figure 3. Several exhibitions and events could be identified and are annotated on the chart. Figure 2: Time based topic modeling for the archive of the Stedelijk Museum Amsterdam

Figure 3: Time based topic modeling for Delpher newspaper articles Conclusion For the humanities researchers in this project, the main aim was to assess the research potential of computational analysis of digitized art archives in general, and the Stedelijk Museum in particular. Community detection, relying on a robust name extraction method, together with time based topic modeling were applied to the archives. The results demonstrate how AI tools may uncover unexpected relationships between people, artworks and organizations, as well as changing patterns over time. A second possible application can be found in connecting previously isolated content with other relevant digital sources, such as RKDartists& and Delpher. The developed techniques therefore enable unstructured art historical archives to speak up in unprecedented ways even if the results are not clean at the first try and do not capture all historical nuance. By allowing the archive to open up, the proposed approaches offer ways to reveals hidden story lines that subvert and augment prevailing historical narratives. References Arora, S., Ge, R. and Moitra, A. (2012). Learning topic models--going beyond SVD. Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on. IEEE, pp. 1 10. Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3): 75 174. Koninklijke Bibliotheek Nederland (2015). Delpher - Boeken Kranten Tijdschriften http://www.delpher.nl/ (accessed 1 November 2015). RKD (2015). Netherlands Institute for Art History https://rkd.nl/en/ (accessed 1 November 2015). Song, S. and Chen, L. (2007). Similarity joins of text with incomplete information formats. Advances in Databases: Concepts, Systems and Applications. Springer, pp. 313 24.

Vaca, C. K., Mantrach, A., Jaimes, A. and Saerens, M. (2014). A time-based collective factorization for topic discovery and monitoring in news. Proceedings of the 23rd International Conference on World Wide Web. ACM, pp. 527 38. Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W., Shi, L., Tan, L. and Zhang, Q. (2010). Tiara: a visual exploratory text analytic system. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 153 62.