Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.



Similar documents
Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining

Data Mining: Introduction

Data Mining. Yeow Wei Choong Anne Laurent

Social Media Mining. Data Mining Essentials

Big Data. Introducción. Santiago González

Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Information Management course

Data mining for prediction

Data Mining and Machine Learning in Bioinformatics

Machine Learning and Data Mining. Fundamentals, robotics, recognition

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

How To Perform An Ensemble Analysis

Numerical Algorithms Group

A Review of Data Mining Techniques

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

DATA MINING - 1DL105, 1Dl111

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Sentiment analysis on tweets in a financial domain

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Introduction. A. Bellaachia Page: 1

MS1b Statistical Data Mining

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

The Basics of Graphical Models

Introduction to Pattern Recognition

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Clustering Technique in Data Mining for Text Documents

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

An Introduction to Data Mining

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

not possible or was possible at a high cost for collecting the data.

An Overview of Knowledge Discovery Database and Data mining Techniques

The Data Mining Process

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Big Data Text Mining and Visualization. Anton Heijs

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

An Overview of Database management System, Data warehousing and Data Mining

Cell Phone based Activity Detection using Markov Logic Network

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

D A T A M I N I N G C L A S S I F I C A T I O N

Data Mining Analytics for Business Intelligence and Decision Support

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

The Scientific Data Mining Process

Statistical Models in Data Mining

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Classification algorithm in Data mining: An Overview

Statistics for BIG data

Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

life science data mining

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

SPATIAL DATA CLASSIFICATION AND DATA MINING

from Larson Text By Susan Miertschin

Protein Protein Interaction Networks

Graph Mining and Social Network Analysis

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Comparison of K-means and Backpropagation Data Mining Algorithms

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Role of Social Networking in Marketing using Data Mining

Introduction to data mining. Example of remote sensing image analysis

Data Mining Applications in Higher Education

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Machine Learning for Data Science (CS4786) Lecture 1

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Information Visualization WS 2013/14 11 Visual Analytics

Knowledge Discovery from patents using KMX Text Analytics

Course: Model, Learning, and Inference: Lecture 5

SOCIAL NETWORK DATA ANALYTICS

Big Data: Rethinking Text Visualization

Business Intelligence and Decision Support Systems

Learning outcomes. Knowledge and understanding. Competence and skills

Data Mining in Telecommunication

Learning is a very general term denoting the way in which agents:

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Data Mining Classification: Decision Trees

Database Marketing, Business Intelligence and Knowledge Discovery

Knowledge-based systems and the need for learning

Text Analytics. A business guide

Data, Measurements, Features

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Introduction to Data Mining

Social Networks and Social Media

Data Mining System, Functionalities and Applications: A Radical Review

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Exploring Big Data in Social Networks

CHAPTER 1 INTRODUCTION

Spam Detection A Machine Learning Approach

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Transcription:

Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1

Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital Social Networks 2/36

What are Social Media? A social network is defined as a network of interactions or relationships, where the nodes consist of actors, and the edges consist of the relationships or interactions between these actors. Milgram in the sixties (well before the invention of the internet), hypothesized the likelihood that any pair of actors on the planet are separated by at most six degrees of separation. Thisisalso referred to as the small world phenomenon. Any web-site or application which provides a social experience in the form of user-interactions can be considered to be a form of social network. Social networks can be viewed as a structure which enables the dissemination of information. The analysis of the dynamics of such interaction is a challenging problem in the field of social networks. 3/36

Mathematical Representation of Social Networks Social networks employ the typical graph notation where stands for the whole network, stands for the set of all vertices and stands for the set of all edges. Example: where and 4/36

Fundamental Data Mining Concepts I Why Mine Data? Commercial Viewpoint Computers have become cheaper and more powerful Lots of data is being collected and warehoused Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) 5/36

Fundamental Data Mining Concepts II Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation 6/36

Fundamental Data Mining Concepts III Mining Large Data Sets Motivation There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all 4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 The Data Gap Total new disk (TB) since 1995 Number of analysts 1995 1996 1997 1998 1999 7/36

Fundamental Data Mining Concepts IV What is Data Mining (Many Definitions) Non-trivial extraction of implicit, previously unknown and potentially useful information from data. Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns. 8/36

Fundamental Data Mining Concepts V Origins of Data Mining: Draws ideas from Machine Learning / AI, Pattern Recognition, Statistics, and Database Systems Traditional Techniques may be unsuitable due to: Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Data Mining Database systems Machine Learning/ Pattern Recognition 9/36

Fundamental Data Mining Concepts VI Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] 10/36

Fundamental Data Mining Concepts VII Classification: Definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for the class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it 11/36

10 10 Fundamental Data Mining Concepts VIII Classification: Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes No Single 75K? Yes Married 50K? No Married 150K? Yes Divorced 90K? No Single 40K? No Married 80K? Learn Classifier Test Set Model 12/36

Fundamental Data Mining Concepts IX Classification: Example (Direct Marketing) Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cellphone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Use this information as input attributes to learn a classifier model 13/36

Fundamental Data Mining Concepts X Clustering: Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance / Cosine Similarity if attributes are continuous. Other Problem-specific Measures. 14/50

Fundamental Data Mining Concepts XI Clustering: Example (Market Segmentation) Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 15/36

Fundamental Data Mining Concepts XII Regression: Definition Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Regression Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices 16/36

Data Mining Tasks on Digital Social Networks: Community Detection I The notion of community denotes groups of people with shared interests or activities. political biases voting patterns consuming habits Community Detection (Definition): is the task of discovering inherent community structures or clusters within online social networks. Structural Definition: Communities as cliques 17/36

Data Mining Tasks on Digital Social Networks: Community Detection II Core Methods : Quality Functions The Kernighan-Lin(KL) algorithm Agglomerative/Divisive Algorithms Spectral Algorithms Multi-level Graph Partitioning Markov Clustering Challenges : Community Discovery in Dynamic Networks Community Discovery in Heterogeneous Networks Coupling Content and Relationship Information for Community Discovery 18/36

Data Mining Tasks on Digital Social Networks: Evolution Analysis Informally, evolution refers to a change that manifests itself across the time axis. Communities can be perceived as clusters built at each time-point: analysis of community evolution involves tracing the same community/cluster at consecutive time-points and identifying changes. Communities can also be perceived as smoothly evolving constellations: community monitoring then involves learning models that adapt smoothly from one time-point to the next. 19/36

Data Mining Tasks on Digital Social Networks: Evolution Analysis II Challenges : Incremental Mining for Community Tracing Tracing Smoothly Evolving Communities Verifying that the communities discovered by a learning algorithm are indeed the real ones Verifying that the community evolution patterns detected are realistic, i.e. they conform to prior knowledge or can be verified by inspection of the underlying data 20/36

Data Mining Tasks on Digital Social Networks: Social Influence Analysis I Social influence is an intuitive and well-accepted phenomenon in social networks[d. Easley and J. Kleinberg]. The strength of social influence depends on many factors such as: the strength of relationships between people in the networks the network distance between users temporal effects characteristics of networks and individuals in the network. A central problem for social influence is to understand the interplay between similarity and social ties [D. Easley and J. Kleinberg] 21/50

Data Mining Tasks on Digital Social Networks: Social Influence Analysis II Homophily is one of the most fundamental characteristics of social networks. This suggests that an actor in the social network tends to be similar to their connected neighbors or friends. The phenomenon of homophily can originate from many different mechanisms: Social influence: This indicates that people tend to follow the behaviors of their friends. Selection: This indicates that people tend to create relationships with other people who are already similar to them; Confounding variables: Other unknown variables exist, which may cause friends to behave similarly with one another. Challenges : Influence Maximization 22/36

Data Mining Tasks on Digital Social Networks: Link Inference / Prediction I Examine : Node Neighborhood based Features. In simple words, it means that in social networks if vertex x is connected to vertex z and vertex y is connected to vertex z, then there is a heightened probability that vertex x will also be connected to vertex y. Shortest Path Distance. The fact that the friends of a friend can become a friend suggests that the path distance between two nodes in a social network can influence the formation of a link between them. The shorter the distance, the higher the chance that it could happen. 23/36

Data Mining Tasks on Digital Social Networks: Link Inference / Prediction II Examine : Hitting Time. The concept of hitting time comes from random walks on a graph. For two vertices, x and y in a graph, the hitting time,, defines the expected number of steps required for a random walk starting at x to reach y. 24/36

Data Mining Tasks on Digital Social Networks: Machine Learning The fundamental assumption of independence is not valid when learning within the context of Social Networks. Social network data sets are often called relational since the relations among entities are central, e.g. : friendship or following ties between members of the social network. interactions such as wall posts, private messages, re-tweeting or tagging a photo. Traditional machine learning algorithms are not directly applicable on relational data sets. 25/36

Data Mining Tasks on Digital Social Networks: Node Classification When dealing with large graphs, such as those arising in the context of social networks, a subset of nodes may be labeled. Existing labels can indicate demographic values, interest or other characteristics of the nodes (users). A core problem is to use this information in order to extend the labeling so that all nodes are assigned with a label. 26/36

Data Mining Tasks on Digital Social Networks: Text Mining I Text mining: seeks to extract useful information from data sources through the identification and exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections. 27/36

Data Mining Tasks on Digital Social Networks: Text Mining II Document Features: Text mining preprocessing operations attempt to transform a natural language document from an irregular and implicitly structured representation into an explicitly structure representation. An essential task for most text mining systems is the identification of a simplified subset of document features that can be used to represent a particular document as a whole. This set of features constitutes the representational model of a document. 28/36

Data Mining Tasks on Digital Social Networks: Natural Language Processing Natural Language Processing is an essential preprocessing step for both Topic Modeling and Sentiment Analysis. NLP Preprocessing Operations: Tokenization Stop-word Removal Stemming Lemmatization Translation Special Characters Replacement Repeating Characters Removal Spelling Correction Synonyms Replacement Replacing Negations with Antonyms 29/36

Data Mining Tasks on Digital Social Networks: Topic Modeling I Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. Discover the hidden themes that pervade the collection. Annotate the documents according to those themes. Use annotations to organize, summarize, and search the texts. Simple Intuition: Documents exhibit multiple topics. 30/36

Data Mining Tasks on Digital Social Networks: Topic Modeling II Fundamental Assumptions: Each topic is a distribution over words. Each document is a mixture of corpuswide topics. Each word is drawn from one of those topics. In reality we observe the documents. The other structure are hidden variables. 31/36

Data Mining Tasks on Digital Social Networks: Topic Modeling III Probabilistic Topic Modeling: the primary objective is to infer the hidden variables, that is compute their distribution conditioned on the documents.,, Nodes are random variables. Shaded nodes are observed. Plates indicate replicated variables. 32/36

Data Mining Tasks on Digital Social Networks: Sentiment Analysis I Sentiment Analysis: the use of natural language processing (NLP) and computational techniques to automate the extraction or classification of sentiment from typically unstructured text. Motivation: Consumer information Product reviews. Marketing Consumer attitudes Trends Politics Politicians want to know voters views Voters want to know politicians stances and who else supports them Social Find like-minded individuals or communities 33/36

Data Mining Tasks on Digital Social Networks: Sentiment Analysis II Related Problems: Which features to use? Words (unigrams) Phrases/n-grams Sentences How to interpret features for sentiment detection? Bag of words (IR) Annotated lexicons (WordNet, SentiWordNet) Syntactic patterns Paragraph structure 34/36

Data Mining Tasks on Digital Social Networks: Sentiment Analysis III Challenges: Harder than topical classification, with which bag of words features perform well Must consider other features due to Subtlety of sentiment expression Irony expression of sentiment using neutral words Domain/context dependence words/phrases can mean different things in different contexts and domains Effect of syntax on semantics 35/36

Data Mining Tasks on Digital Social Networks: Sentiment Analysis IV Approaches: Supervised Methods Naïve Bayes Maximum Entropy Classifier SVM Markov Blanket Classifier Accounts for conditional feature dependencies Assume pairwise independent features Allowed reduction of discriminating features from thousands of words to about 20 (movie review domain) Unsupervised methods Use lexicons 36/36