Contemporary Techniques for Data Mining Social Media

Transcription

1 Contemporary Techniques for Data Mining Social Media Stephen Cutting ( ) 1 Introduction Social media websites such as Facebook, Twitter and Google+ allow millions of users to communicate with one another in a range of dynamic ways [Zafarani et al., 2014, p.18] and also offer new ways for data about these users to be collected, processed and analysed [Adedoyin-Olowe et al., 2013]. The unique nature of social networks means that the data they produce offers distinct challenges for those seeking to use it for data mining [Zafarani et al., 2014, p.15]. This report analyses the challenges posed to the knowledge discovery in databases (KDD) process, of which data mining forms a part, by data gathered from social media websites. The report also examines the main stages of the KDD process in the context of data mining social media data. 2 Challenges Presented by Social Media Data Due to the nature of social media, it can generate a wide range of different types of data including user profile information, text, video and images. In order for many tasks related to data mining social media to be successful, these different of forms of data have to processed and analysed effectively [Witten and Frank, 2005]. This can create unique challenges [Adedoyin-Olowe et al., 2013, p.1]. Different types of often complex relationships and interactions between users, or social atoms, communities, or social molecules and content must be analysed, requiring the development of new methodologies and algorithms [Zafarani et al., 2014, pp.16-22]. Because well known social networks can have tens of millions of users [Rossi, 2010], vast amounts of data can be produced by them (see Figure 1). Those trying to mine data gathered from social media often face challenges caused by the sheer amount of generated data that needs to be processed and analysed [Adedoyin-Olowe et al., 2013, p.1] and [Zafarani et al., 2014, p.15]. The challenges of mining data from social media include data processing and data storage. Processing large volumes of data can be time consuming and very memory intensive [Yan et al., 2015]. It is not practical to use some data processing techniques that can be used on smaller data sets when dealing with very large amounts of data, including data replication and data migration [Lawrence et al., 2013, p.70]. Physically storing enormous amounts of data may be prohibitively time consuming or expensive for some individuals and organisations. Although social media generates and enormous amount of data, there can be a lack of information about specific entities and this is known as the big data paradox [Zafarani et al., 2014, p.17]. This issue could cause problems if data was needed on an individual user. The unstructured, noisy nature of data produced by social media can make data mining more difficult [Zafarani et al., 2014, p.15], [Adedoyin-Olowe et al., 2013, p.1]. One reason that data collected from such sources is frequently noisy and unstructured is that it is generated by a large number of individual users [Zafarani et al., 2014, p.15]. Noise in data can cause a range of issues in the data mining process. Erroneous data can have a negative impact on data mining algorithms simply because the data that they are processing contains a large number of errors [Wu and Zhu, 2008, p.917]. Misclassification can often occur, 1

2 even in data that is being utilised while training the classifier [Witten and Frank, 2005, p.5]. Errors, or noise, in data can also make it more difficult for the algorithms used in machine learning to identify an optimal concept description for a specific set of data because an optimal data description may be discarded because it contains attributes that are identified as errors [Witten and Frank, 2005, pp.26-29]. Figure 1: Diagram of estimated data generated on social network sites every minute. Source: [Zafarani et al., 2014, p.36] 3 Harvesting Social Media Data Application program interfaces (APIs) are one of the most common ways that data is gathered from social media [Zafarani et al., 2014, p.17], [Abdulrahman et al., 2013]. APIs enable software applications to communicate with one another securely and accurately over the Internet [Abdulrahman et al., 2013, p.89] and often take the form of libraries. Facebook, Twitter and Google+ all have APIs that developers can utilise when carrying out tasks such as data mining [Russell, 2013, p.48]. Although it is being phased out [Facebook, 2014a], Facebook s Graph API has an SQL like language called Facebook Query Language (FQL), which can be used to develop queries for the API [Facebook, 2014a]. Facebook s Graph API is a good example of an API that is used in the data mining of social 2

3 media. It is the primary way that people request and submit data to Facebook s social graph [Facebook, 2014b]. Social graphs are a representation of the information held on social networks and [Sharad and Danezis, 2014]. The two main tables that make up Facebook s social graph are called node and link [Facebook, 2014b]. The node table contains information about nodes, which represent objects such as users, photos and videos [Qu and Dessloch, 2014, p.139]. The link table contains information about the links between different objects, whether two users or friends or whether a user likes a photograph for example [Qu and Dessloch, 2014, p.139]. 4 Storing Social Media Data Some KDD techniques require data to be stored in some way [Adedoyin-Olowe et al., 2013]. This section will explore how social media data is stored in preparation for such analysis. As has been discussed, storing the massive amounts of data that can be produced by social media websites for use in KDD processes can present challenges. As those running such websites require more storage space, scalability becomes an important consideration [Tran, 2013, p.1]. There are two primary methods for addressing this scalability issue; horizontal scaling and vertical scaling [Tran, 2013, p.1]. Vertical scaling is the addition of more hardware resources to servers [Pokorny, 2013]. Horizontal scaling involves the spreading of the workload across new commodity servers [Pokorny, 2013]. Horizontal scaling is utilised by most social media websites today as it avoids bottleneck issues and is more cost effective than vertical scaling [Tran, 2013, p.1]. Although it is well known that duplicate data can cause issues in distributed database systems [Witten and Frank, 2005, p.54]; it can be useful in the context of social media websites. This is because replicated data can be used to minimise the disruption caused by a system failure [Tran, 2013, p.1]. It has also been shown that it can be more efficient for, sometimes duplicate, data related to connected individuals to be stored on servers that are more closely connected [Tran, 2012, p.2]. A storage scheme that adheres to this principle is known as being socially aware [Tran, 2012]. 5 Data Cleansing Data cleansing is an important part of the KDD process in general [Witten and Frank, 2005, pp.48-54] but it is especially important and potentially more difficult when the data comes from a social media website, as is discussed in section 1 of this report. Commonly used KDD data cleansing techniques are utilised in this process, which aim to remove noise and deal with outliers, missing values and duplicate data [Zafarani et al., 2014, p.143]. As has been discussed, data from social media websites can be very noisy. Filtering algorithms are often effective at reducing the amount of noise in data in the data cleansing process [Zafarani et al., 2014, p.141]. The reduction of noise in data can lead to data being lost [Wu and Zhu, 2008]. Although it is often true that reducing class noise in a training dataset can frequently help a learner be more accurate [Zhu and Wu, 2004]. Simply deleting a noisy instance from a dataset that contains erroneous attribute values can cause issues [Wu and Zhu, 2008]. This is because additional correct attribute values of the instance may still aid the learning process [Zafarani et al., 2014, p.141]. There is no evidence to suggest that data cleansing can improve data mining results on data sets that contain erroneous or missing data [Zafarani et al., 2014, p.141]. 6 Pre-processing Social Media Data In social media, a great deal of information is represented as networks. Which can be sampled by selecting a subset of their edges and nodes through the utilisation of various sampling techniques 3

4 [Adedoyin-Olowe et al., 2013]. Networks can additionally be sampled using a small set of nodes. This is achieved by sampling the set of edges and nodes that are connected or closely related to them and their connected components [Adedoyin-Olowe et al., 2013]. Aggregation is one technique that is used in the pre-processing of data in the KDD process and it involves the merging of multiple objects into a single object [Tan et al., 2006, pp.45-47], [Witten and Frank, 2005, p.49]. Aggregation is used in pre-processing data gathered from social media websites [Zafarani et al., 2014, p.142]. Combining multiple features can save storage space and reduce data variance, which can cause the data to have a higher resistance to noise and distortion [Zafarani et al., 2014, p.142]. Discretisation is the process of converting a continuous attribute into a categorical attribute, which may be necessary if a data mining algorithm needs data to be in the form of categorical attributes [Tan et al., 2006, p.57]. Discretisation is also useful as it allows data from social media that has been placed in a group to be compared with other information and analysed as a whole [Zafarani et al., 2014, p.142]. The selection of subsets of features helps reduce the amount of redundant and irrelevant features in a data set [Tan et al., 2006, p.52] and it is used in the pre-processing of data from social media [Zafarani et al., 2014, p.142]. This is important because the irrelevant or redundant features can cause the quality of the clusters that are identified and the accuracy of classification to be reduced [Tan et al., 2006, p.52]. Sampling is selecting and processing a small, random section of the data, which avoids processing the entire data set [Riondato, 2014]. This is especially important in the process of data mining information gathered from social media as processing an entire data set can be prohibitively computationally expensive [Zafarani et al., 2014, p.143]. 7 Mining Social Media Data Data mining is a stage in the KDD process in which pre-processed data is analysed to discover relationships and patterns [Zafarani et al., 2014, p.135]. Various techniques are utilised in the data mining of data from social media websites and some of the most relevant of these are discussed in this section. The nature of the data that is being analysed often defines the best data mining techniques to deploy when analysing it [Zafarani et al., 2014, pp ]. Figure 2: Diagram of KDD process. Source: [Zafarani et al., 2014, p.36] There are several different categories of data mining algorithms. Supervised learning and unsupervised learning are two important categories in the data mining of data from social media websites. In supervised learning, the class attribute exists, and the objective is to predict the 4

5 class attribute value [Adedoyin-Olowe et al., 2013]. In unsupervised learning, the dataset has no class attribute, and the objective is to identify similar instances in the data and place them in groups [Adedoyin-Olowe et al., 2013]. The grouping of instances that share similarities allows the identification of important patterns in a dataset [Cabanes et al., 2010]. Despite the use of filtering algorithms, like those discussed in section 5 of this report, any data mining effort is likely to suffer if there are many errors in data [Wu and Zhu, 2008, p.918]. This means that it is important that data mining algorithms are error tolerant and do not allow themselves to be severely adversely affected by a relatively small amount of noise in the data being analysed [Wu and Zhu, 2008, p.918]. Graph theory, the study of graphs, is very important in data mining information from social media. This is because this methodology allows the identification important features of the data and it is particularly effective at dealing with large datasets [Adedoyin-Olowe et al., 2013]. Community detection is one graph theory technique that is used to analyse social media data. This entails the utilisation of methodologies including vertex clustering, a form of hierarchical clustering, to try to identify the social molecules, or communities, that social atoms, or users, belong to [Adedoyin-Olowe et al., 2013]. 8 Conclusions As social media websites continue to grow, the amount of data that they produce will increase. Fortunately, the combination of the ever increasing power and storage capacity of computer systems and the development of algorithms and methodologies like those discussed in this report means that this data may be able to be used in KDD processes. The data mining stage of these processes may well be able to reveal a great deal about the individuals that make up these vast social networks and the dynamic connections between them. References [Abdulrahman et al., 2013] Abdulrahman, R., Neagu, D., Holton, D. R. W., Ridley, M. J., and Lan, Y. (2013). Data Extraction from Online Social Networks Using Application Programming Interface in a Multi Agent System Approach. T. Computational Collective Intelligence, 11: [Adedoyin-Olowe et al., 2013] Adedoyin-Olowe, M., Gaber, M. M., and Stahl, F. (2013). Survey of Data Mining Techniques for Social Media Analysis. CoRR, abs/ A [Cabanes et al., 2010] Cabanes, G., Bennani, Y., and Fresneau, D. (2010). Mining RFID behavior data using unsupervised learning. IJAL, 1(1): [Facebook, 2014a] Facebook (2014a). Facebook Query Language (FQL) Overview. https: //developers.facebook.com/docs/technical-guides/fql/. [Online; Accessed February 2015]. [Facebook, 2014b] Facebook (2014b). Quickstart for Graph API. facebook.com/docs/graph-api/. [Online; Accessed February 2015]. [Lawrence et al., 2013] Lawrence, B. N., Bennett, V. L., Churchill, J., Juckes, M., Kershaw, P., Pascoe, S., Pepler, S., Pritchard, M., and Stephens, A. (2013). Storing and manipulating environmental big data with JASMIN. In Hu, X., Lin, T. Y., Raghavan, V., Wah, B. W., Baeza-Yates, R. A., Fox, G., Shahabi, C., Smith, M., Yang, Q., Ghani, R., Fan, W., Lempel, R., and Nambiar, R., editors, Proceedings of the 2013 IEEE International Conference on Big Data, 6-9 October 2013, Santa Clara, CA, USA, pages IEEE. 5

6 [Pokorny, 2013] Pokorny, J. (2013). Nosql databases: a step to database scalability in web environment. International Journal of Web Information Systems, 9(1): [Qu and Dessloch, 2014] Qu, W. and Dessloch, S. (2014). A Demand-Driven Bulk Loading Scheme for Large-Scale Social Graphs. In Manolopoulos, Y., Trajcevski, G., and Kon- Popovska, M., editors, Advances in Databases and Information Systems - 18th East European Conference, ADBIS 2014, Ohrid, Macedonia, September 7-10, Proceedings, volume 8716 of Lecture Notes in Computer Science, pages Springer. [Riondato, 2014] Riondato, M. (2014). Sampling-based data mining algorithms: Modern techniques and case studies. In Calders, T., Esposito, F., Hüllermeier, E., and Meo, R., editors, Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Nancy, France, September 15-19, Proceedings, Part III, volume 8726 of Lecture Notes in Computer Science, pages Springer. [Rossi, 2010] Rossi, L. (2010). Playing your network: gaming in social network sites. Available at SSRN [Russell, 2013] Russell, M. (2013). Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. O Reilly Media. [Sharad and Danezis, 2014] Sharad, K. and Danezis, G. (2014). An Automated Social Graph De-anonymization Technique. CoRR, abs/ [Tan et al., 2006] Tan, P.-N., Steinbach, M., Kumar, V., et al. (2006). mining, volume 1. Pearson Addison Wesley Boston. Introduction to data [Tran, 2012] Tran, A. (2012). Data Storage for Social Networks: A Socially Aware Approach. SpringerBriefs in Optimization. Springer. [Tran, 2013] Tran, D. (2013). Data Mining Social Networks: A Socially Aware Approach. Springer. [Witten and Frank, 2005] Witten, I. H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. [Wu and Zhu, 2008] Wu, X. and Zhu, X. (2008). Mining With Noise Knowledge: Error-Aware Data Mining. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 38(4): [Yan et al., 2015] Yan, D., Yin, X., Lian, C., Zhong, X., Zhou, X., and Wu, G. (2015). Using Memory in the Right Way to Accelerate Big Data Processing. J. Comput. Sci. Technol., 30(1): [Zafarani et al., 2014] Zafarani, R., Abbasi, M. A., and Liu, H. (2014). Social Media Mining: An Introduction. Cambridge University Press. [Zhu and Wu, 2004] Zhu, X. and Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3):