Mobile Social Media Mining Challenges Overview: A Case Study of WeChat

Similar documents
Introduction. Chapter 1

Role of Social Networking in Marketing using Data Mining

Contemporary Techniques for Data Mining Social Media

Big Data: Study in Structured and Unstructured Data

Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

Sentiment Analysis on Big Data

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

How To Solve The Kd Cup 2010 Challenge

Social Networks and Social Media

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

Sentiment analysis on tweets in a financial domain

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Keywords social media, internet, data, sentiment analysis, opinion mining, business

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

CHAPTER 2 Social Media as an Emerging E-Marketing Tool

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Mobile Gaming on Messenger Apps: The Future of Mobile Entertainment

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

IMAV: An Intelligent Multi-Agent Model Based on Cloud Computing for Resource Virtualization

Neuro-Fuzzy Classification Techniques for Sentiment Analysis using Intelligent Agents on Twitter Data

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

The British Academy of Management. Website and Social Media Policy

How To Analyze Sentiment On A Microsoft Microsoft Twitter Account

Text Mining - Scope and Applications

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

INTRODUCTION TO SOCIAL MEDIA

MULTI-DOMAIN CLOUD SOCIAL NETWORK SERVICE PLATFORM SUPPORTING ONLINE COLLABORATIONS ON CAMPUS

SOCIAL MEDIA MARKETING 101. By Debbie Laskey, MBA

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Social Networking. Tim Trampedach, RockYou!

Random forest algorithm in big data environment

Distributed Computing and Big Data: Hadoop and MapReduce

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Safe Harbor Statement

Prediction of Heart Disease Using Naïve Bayes Algorithm

An Overview of Knowledge Discovery Database and Data mining Techniques

Rushern L. Baker, III County Executive. Presented By: Eben Smith, Contract Compliance Officer Minority Business Development Division

iphone Translation Apps

JamiQ Social Media Monitoring Software

Inner Classification of Clusters for Online News

Integration of Social Media in Businesses

Available online at Available online at Advanced in Control Engineering and Information Science

Search Result Optimization using Annotators

How To Be Successful With Social Media And Marketing

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Research on Sentiment Classification of Chinese Micro Blog Based on

Bisecting K-Means for Clustering Web Log data

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Above the fold: It refers to the section of a web page that is visible to a visitor without the need to scroll down.

Social Media Measurement Meeting Robert Wood Johnson Foundation April 25, 2013 SOCIAL MEDIA MONITORING TOOLS

How To Market Your Website Online

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

An Effective Analysis of Weblog Files to improve Website Performance

Text and data analytics for social network mining

Digital Marketing Capabilities

A NEW APPROACH TO FILTER SPAMS FROM ONLINE SOCIAL NETWORK USER WALLS

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

China Search International Introducing Baidu

Using Data Mining for Mobile Communication Clustering and Characterization

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing

Blog Post Extraction Using Title Finding

Analysis of Tweets for Prediction of Indian Stock Markets

CloudRank-D:A Benchmark Suite for Private Cloud Systems

How To Make Sense Of Data With Altilia

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Exploring Big Data in Social Networks

Behind the great firewall

DIGITAL, SOCIAL, AND MOBILE IN APAC 2015 WE ARE SOCIAL & IAB SINGAPORE S COMPENDIUM OF ASIA-PACIFIC DIGITAL STATISTICS.

WHITEPAPER. Text Analytics Beginner s Guide

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Tweet! Tweet! Using Twitter to Reach an Audience. Richard Harrington, PMP CEO RHED Pixel. youtube.com/ rhedpixeltv. facebook.com/ RichHarringtonStuff

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Cloud Computing-upcoming E-Learning tool

Using Social Networking Sites as a Platform for E-Learning

DESIGN AND IMPLEMENTATION OF HYBRID CLASSIFICATION ALGORITHM FOR SENTIMENT ANALYSIS ON NEWSPAPER ARTICLES

Data Refinery with Big Data Aspects

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Categorical Data Visualization and Clustering Using Subjective Factors

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

INTRODUCTION TO THE WEB

Methods of Social Media Research: Data Collection & Use in Social Media

Data Mining Solutions for the Business Environment

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Decision Support System For A Customer Relationship Management Case Study

Transcription:

Mobile Social Media Mining Challenges Overview: A Case Study of WeChat Thomas E. Epalle School of Mathematics, Physics and Information Engineering Zhejiang Normal University, Jinhua, China epallethomas@yahoo.fr Abstract Data mining is likely to be considered a much valuable tool especially when applied to social media. Social media mining is the process of representing, analyzing and extracting useful patterns from data in social media, resulting from social interactions. Since its release in 2011 WeChat quickly became the most fast-growing networking mobile app in China. People of similar interests and backgrounds meet and cooperate using this social network, enabling them to share information flexibly and globally. Today just like Facebook and Twitter, WeChat contains millions of unprocessed raw data. The mined information from this mobile social media can considerably impact business strategy of any business enterprise. However, for the general public, mining WeChat unlike Facebook and Twitter remains a very complex task. This paper identifies and analyzes some challenges engineers and researchers may face if they want to mine this rich user-generated contains from WeChat database. Keywords: Mobile Social Media; WeChat; Challenges; Data Mining; Opinion Mining; 1 Introduction This era is known to be the big data age. Hundreds of millions of people all over the world are spending countless hours to connect, interact, communicate and share data using social media. Social media is now one of the biggest repositories of big data. With this big data comes an unprecedented opportunity and potential for data mining research. Mining data in social media is the process of collecting, searching, and analyzing social media structure as well as the large amount of user-generated data so as to discover useful patterns and relationships. This new research field has attracted researchers from different backgrounds such as computer science, social sciences, data mining, machine learning, natural language processing, text mining, social media analysis and information retrieval. Because social media data differs from data used in classical data mining, new methods are needed to explore and analyze this unparalleled source of data. Many definitions of social media have been proposed in the literature[1][2][3]. Some definitions are more abstract than others like the one proposed by the authors of [2] who define social media as a set whose elements are social atoms (individuals), entity and interactions. Kaplan and Haenlein propose a more concise definition. They consider social media as the group of internet-based applications that build on the ideological and technological foundations of web 2.0, and that allow the creation and exchange of user generated content[4]. There are many types of social media. Social media classification and characterization varies from author to author in the rich social media literature[1][2]. [4] classify social media sites into blogs, content communities, collaborative projects, virtual social worlds, virtual game works and social networking. The authors of [2] give another but interesting classification of social media. The table below summarizes their classification and lists some real examples: Type of social media Social networking Microblogging Photo sharing News aggregation Examples Facebook, LinkedIn Twitter, Weibo Flickr, Photobucket, Picasa Google Reader, StumbleUpon, Feedburner Youtube, Youku, MetaCafe Ustream Justin.TV Kaneva World of Warcraft Google, Bing, Ask Google talk, Skype, Yahoo!Messenger, WeChat, QQ, Bigo, Whatsapp Table 1: Types of social media Video sharing Livecasting Virtual worlds Social gaming Social search Instant messaging Mobile social media are those providing an app we can use both on our computer and our Android, iphone, BlackBerry, Windows Phone and Symbian. 347

In this paper we consider the challenges data mining researchers and engineers have to overcome in other to mine their vast amount of user generated data. WeChat is used in a case study approach with the hypothesis that its mining challenges can be easily generalized to other mobile social media apps. 2 What is WeChat? WeChat ( 微 信, Weixin) which literally means micro message is the mostly used mobile voice and text messaging stand alone app in China. It was first released in January 2011 by Tencent company. By august 2014 WeChat already had 438 000 000 active users with 70 000 000 users outside of China. WeChat app is available for Android, iphone, Black- Berry, Windows Phone and Symbian phones. There are also web based clients, but the user must have the app installed on a supported mobile phone for identification by scanning a QR code. WeChat registration is done with the phone numbers or with Facebook account. This app provides text messaging, hold-to-talk voice messaging, broadcast messaging, sharing of pictures and videos, and location sharing. It can exchange contacts with people nearby via Bluetooth, as well as providing various features for contacting people at random if desired (if these people are open to it) and integration with social networking services such as those run by Facebook and Tencent QQ. Photographs may also be embellished with filters and captions, and a machine translation service is also available. Thus it is obvious that WeChat is a rich warehouse for data mining engineering and research. The network structure, relationships, groups, communities and users-generated contain (text, video file, audio files and pictures) both can be used for mining. The scope of this work is limited to network structure mining and text mining (facts as well as opinions). The relevance of studying WeChat mining challenges as an instance of mobile social media challenges can be explained by the fact that this app factors the common characteristics of other mobile social networking and instant messages sites with the same properties such as identity, conversation, sharing, presence, relationship and groups. 3 WeChat mining challenges: Unlike Twitter and Facebook mining on which much research and technical works have focused in these last few years[5], WeChat mining has obtained very little attention from the research community. Though mining WeChat has a lot of premises, access to this rich mine of data is obstructed by some challenges we want to address in the remainder of this paper. 3.1 Data extraction and pattern evaluation challenges Social media mining is an emerging field which has more problems than ready solutions. One of the basic steps in social media mining is data extraction. The most commonly used method to collect data from social media is via application programming interfaces (APIs). Available API for WeChat can be obtained at http://developers.wechat. com/wechatapi. This available API allows a programmer to develop an app that can send messages to WeChat users either in their message boxes or in their WeChat moments. This API has two versions: iphone version and android version. For now there is no publicly available API that can help collecting user generated data from WeChat. This lack of open API may be the most challenging issue if we want to mine WeChat. One way of getting rid of this challenge is to develop an API for this purpose. Programmers who are interested in this should learn how to use WeChat SDK platform and the XCode environment to provide a free API to the general public. The new API should allow users to collect data from remote WeChat database servers to a computer for mining purposes. The present limited API works only on mobile devices whose computational power is low for running mining algorithms on big data. Still there may be other troubles arising from internet legal issues in China (see section 3.5). Even if this major API issue should be solved there remain other data extraction challenges: What has been called the big data paradox: While mobile social media data is undoubtedly big, however information about particular entity can be very little. Sampling problems: Without knowing in advance WeChat population statistical distribution, how can we be sure that the sample is representative of the global data? Consequently it becomes difficult to make sure that the findings obtained from mining are indicators of true patterns profitable to business development or research. Mining results evaluation challenges: In classical data mining data sets are used as training and test sets. In the case of social media how can one construct these training and test sets? Evaluating patterns in mobile social media mining seems to be nearly insurmountable 348

3.2 Mobile social media analysis challenges Mobile social media practically has a classical social media structure. Being networks through which individuals can be connected through special links they are generally modeled using graph representations. Such networks are represented by graphs whose nodes are peoples and whose vertices characterize links between people. Many graphs models for social media analysis have been proposed[2], among which the little world model is the most used. Mining social media structure is an important aspect of social media mining. Unfortunately the methods used generally to carry out data mining are poorly adapted to the graph structure of social network. This graph structure makes data mining algorithms ineffective when mining social media due to their size and dynamic properties. Let us illustrate this point by considering for instance some group detection algorithms in social media: Graph partitioning by minimum cuts [6]: This algorithm can only be effective if the size of the graph is known in advance. Hierarchical clustering[7]: This method requires a high cost in time (O(n 2 log n)) and is not adapted to structures that are not basically hierarchical. K-means clustering[8]: Like for other graph partitioning algorithms, the number of partitions must be set in advance. The spectral clustering algorithm which uses graph matrix representations and their eigenvalues to define clusters[9]. In this case also the size of the graph must be constant. Moreover these dynamic properties of social networking graphs have not yet been considerably addressed in literature. 3.3 Natural language processing challenges Text messages or tweets are the most common communicaton method used by WeChat users in Taiwan, India, Hong-Kong, USA, Thailand, Indonesia and of course Mainland China. With more than 70 million users outside China mainland, WeChat tweets are undoubtedly multilingual. WeChat text messages use many languages including simplified and traditional Chinese, English, french, spanish and other languages. Language identification in tweets is a particular problem, due to their short length, and the use of language independent tokens: hashtags, @mentions, numbers, URLs, emoticons. This are new challenges for text mining and natural language processing. Therefore we should employ proper language identification tools in mining textual data from WeChat. 3.4 Opinion mining and sentiment analysis challenges One of the most interesting areas of social media mining is opinion mining or sentiment analysis. The expressions sentiments analysis and opinion mining are generally synonymously used in the literature[10]. This field of study analyzes people s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. In recent years a large amount of research has focus on opinion mining from the web in general like in discussion forums, review sites, blogs and e-commerce sites[11][1][12][13], and from social media in particular[14][15]. The purpose of opinion mining and sentiment analysis is to predict, for example, customer s preferences for a specific product which is valuable for economic or marketing research. The purpose of opinion mining from social media can also be to make a summerization of opinions concerning a specific entity (event, product, policy, persons). The social media is full of opinionated text. The challenges faced by opinion mining in social media are inherited from the challenges faced by natural language processing and text mining in this area. There are many challenges in applying traditional opinion and sentiment analysis techniques to mobile social media like WeChat[14][16]. Tweets do not always follow a conversation thread leading to the lost of contextual information and ambiguity. Some works, [17] for example, have tried to reduce this contextual ambiguity using n-grams but this issue remains a great challenge. Irony and sarcasm in expressing opinion in mobile social media are difficult for a machine to detect. Most opinion mining techniques in social media make use of machine learning techniques. Classifiers like support vector machines (SVM) and naive Bayes built using supervised methods perform well on sentiment polarity detection, but when applied in new domains, their accuracy reduces drastically[18][19][20]. 3.5 China internet censorship challenges China is one of the countries that practice internet control and censorship. Methods of internet control include web contain regulation techniques that decide if some tweets should be present in social media (or any other website) or should be automatically 349

deleted[21]. Other censorship techniques consist of restricting user behavior, physical network connections, market orders and technological standards[22]. These deletions and restrictions have a negative effect on data integrity and are likely to bring a lot of hardships in the process of data extraction. Whether data in WeChat database belong to Tencent Company in Shenzhen or to the central government of the People s Republic of China, in either case, due to internet monitoring and control policy in the country, it may be illegal for Tencent Company (and any other person or business) to develop an API making WeChat massive data available to the public for mining purposes[23], except under the strict supervision of the government. 4 Conclusion In this short paper we used the famous Chinese mobile social networking app WeChat as an instance to illustrate a variety of challenges in mobile social media data mining engineering and research. Most of these challenges could be applied to other mobile social networking sites. There are legal challenges depending on each country internet law and policies. China is a special case where legal challenges apply with internet control and censorship laws in this country. There are also technical challenges like the development of new application programming interfaces (API). This technical challenge requires learning new programming languages, like Xcode for WeChat. Still there are challenges related to the size and the dynamic characteristics of mobile social media graphs which cause social media structure mining algorithms to work poorly. Other challenges concern the fields of natural language processing and opinion mining from social media where obtaining adequate samples and applying proper evaluation to mining results seem insurmountable. More future research could be dedicated to address each of these challenges. References [1] Pippal Sanjeev, Batra Lakshay, Krishna Akhila, Gupta Hina, and Arora Kunal. Data mining in social networking sites: A social media mining approach to generate effective business strategies. International journal of Innovations and Advancement in Computer Science Volume 3, Issue 2, pages 22 27, April 2014. [2] Zafarani Reza, Ali Abbasi Mohammad, and Liu Huang. Social media mining: an introduction. Cambridge university press, 2014. [3] boyd Danah M. and Ellison Nicole B. Social network sites: definition, history, and scholarship. Journal of Computer-Mediated Communication 13, No. 1, pages 59 68, 2010. [4] Kaplan Andreas M. and Haenlein Michael. Users of the world, unite! the challenges and opportunities of social media. Business horizons 53, No. 1, pages 210 230, 2007. [5] Zafarani Reza, Abbasi Mohammad Ali, and Liu Huang. Mining the Social Web Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites. O Reilly Media, 2011. [6] Nagamochi Hiroshi. Algorithms for the minimum partitioning problems in graphs. Electronics and Communications in Japan, Part 3, Vol. 90, No. 10, Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J86-D-I, No. 2, February 2003, pp. 53-68, pages 63 78, 2007. [7] Jonyer Istvan, Cook Diane J., and Holder Lawrence B. Graph-based hierarchical conceptual clustering. Journal of Machine Learning Research, pages 19 43, February 2001. [8] Galluccio Laurent, Michel Olivier, Comon Pierre, and Hero Alfred O. Graph based k-means clustering. Signal Processing Volume 92, Issue 9, pages 1970 1984, September 2012. [9] Nascimento Maria C.V. and de Carvalho Andre C.P.L.F. Spectral methods for graph clustering: A survey. European Journal of Operational Research Volume 211, Issue 2, pages 221 231, June 2011. [10] Liu Bing. Sentiment Analysis and Opinion Mining. Morgan Claypool Publishers, 2012. [11] Vinodhini G. and Chandrasekaran RM. Sentiment analysis and opinion mining: a survey. International Journal of Advanced Research in Computer Science and Software Engineering, pages 282 292, June 2012. [12] Todi Aditi, Agrawal Anahita, Taparia Ankit, Lakhmani Nikhlesh, and Shettar Rajashree. An opinion-tree based flexible opinion mining model. International Journal of Engineering Science and Advanced technology, Volume-2, Issue-3, pages 550 554, 2012. [13] Siddiki Ahmad Tasnim and Aljahdali Sultan. Web mining techniques in e-commerce applications. International Journal of Computer Applications Volume 69, No.8, pages 39 43, May 2013. [14] Karamibekr Mostafa and Ghorbani Ali A. A structure for opinion mining in social domains. Social- Com/PASSAT/BigData/EconCom/BioMedCom, pages 264 271, 2013 IEEE. [15] Maynard Diana, Bontcheva Kalina, and Rout Dominic. Challenges in developping opinion mining tools for social media. [16] Rahmath P. Haseena. Opinion mining ans sentiment analysis: challenges and applications. International Journal of Application or Inovation in Engineering and Management, Volume 3, Issue 5, pages 401 403, May 2014. 350

[17] Shelke Nilesh M., Deshpande Shriniwas, and Thakre Vilas. Survey of techniques for opinion mining. International Journal of Computer Applications, Volume 57, No. 13, pages 30 35, November 2012. [18] Khairnar Jayashri and Kinikar Mayura. Machine learning algorithms for opinion mining and sentiment classification. International Journal of Scientific and Research Publication, Volume 3, Issue 6, June 2013. [19] Singh Pravesh Kumar and Husain Mohd Shahid. Methodological study of opinion mining and sentiment analysis techniques. International Journal on Soft Computing, Vol.5, No. 1, pages 11 21, February 2014. [20] Buche Arti, Chandak M.B., and Zadgaonkar Akshay. Opinion mining and analysis: a survey. International Journal on Natural Language Computing, Volume 2, No. 3, pages 39 47, June 2013. [21] D. Bamman, B. O Connor, and Smith N. Censorship and deletion practices in chinese social media. First Monday Online 17(3), 2012. [22] Li Xiaoyu and Robbin Alice. How china regulates online content: a policy evolution framework. IADIS International Journal on WWW/Internet Vol. 11, No. 3, pages 35 45, 2012. [23] Khanna Rohan, Dhingra Vikram, and Choudhary Kavita. Internet censorship: Freedom vs security. International Journal of Computer Trends and Technology(IJCTT), volume 4 Issue 8, pages 2695 2698, August 2013. 351