Challenges and Opportunities in Data Mining: Big Data, Predictive User Modeling, and Personalization Bamshad Mobasher School of Computing DePaul University, April 20, 2012
Google Trends: Data Mining vs. Analytics 2
The Big Question? Will data mining remain relevant? If so, how? Quick survey: Do you think the amount of data available in the digital world will decrease in the future? will become less complex? Where is the Life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? -- T.S. Eliot, The Rock 3
How much data? Google: ~20-30 PB a day Wayback Machine has ~4 PB + 100-200 TB/month Facebook: ~3 PB of user data + 25 TB/day ebay: ~7 PB of user data + 50 TB/day CERN s Large Hydron Collider generates 15 PB a year In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB 640K ought to be enough for anybody.
The Data Tsunami McKinsy Global Institute Report: Big Data: the next frontier for innovation, competition and productivity 5
Big Data Value McKinsy Global Institute Report: Big Data: the next frontier for innovation, competition and productivity 6
7
8
What s Seen the Most Growth in 2008-2011 Types of Data Location / Geo / Mobile Data Music / Audio Social Media / Social Networks Time Series Images / Video User Profile data Text feeds / Micro-blog data Types of Activities/Areas Search / Web content mining Text mining / opinion analysis Personalization / recommendation Social network / Social media analysis Topic modeling / micro-blog analysis Health informatics Much of this growth is driven by end user mobile or Web-based applications users are inundated with huge volume of complex information need for more personalized intelligent applications 9
Personalization The Problem Dynamically serve customized content (pages, products, recommendations, etc.) to users based on their profiles, preferences, or expected interests Why we need it? Information spaces are becoming much more complex for user to navigate (huge online repositories, social networks, mobile applications, blogs,.) For businesses: need to grow customer loyalty / increase sales Industry Research: successful online retailers are generating as much as 35% of their business from recommendations 10
Data Mining and Personalization Killer App for data mining? Tangible successes both in the research and in industrial applications recommender systems personalized Web agents user adaptive systems Web marketing and ecrm personalized search Sophisticated modeling approaches based on both predictive and unsupervised DM techniques 11
Personalization Common Approaches Collaborative Filtering Give recommendations to a user based on preferences of similar users Content-Based Filtering Give recommendations to a user based on items with similar content in the user s profile Rule-Based (Knowledge-Based) d) Filteringi Provide recommendations to users based on predefined (or learned) rules age(x, 25-35) and income(x, 70-100K) and childred(x, >=3) recommend(x, Minivan) Combined or Hybrid Approaches 12
The Recommendation Task Basic formulation as a prediction problem Given a profile P u for a user u, and a target item i t, predict the preference score of user u on item i t Typically, the profile P u contains preference scores by u on some other items, {i 1,, i k } different from i t preference scores on i 1,, i k may have been obtained explicitly (e.g., movie ratings) or implicitly itl (e.g., time spent on a product page or a news article) 13
The Recommendation Task Content-Based Recommendation Predictions for unseen (target) items are computed based on their similarity (in terms of content) to items in the user profile Collaborative Recommendation Predictions for unseen (target) items are computed based the other users with similar interest scores on items in user u s profile i.e. users with similar tastes (aka nearest neighbors ) requires computing correlations between user u and other users according to interest scores or ratings k-nearest-neighbor (knn) strategy 14
Content-Based Recommender Systems 15
Content-Based Recommenders: Personalized Search How can the search engine determine the user s intent?? Query: Madonna and Child Need to learn the user profile:? User is an art historian? User is a pop music fan? 16
Content-Based Recommenders :: more examples Music recommendations Play list generation Example: Pandora 17
Collaborative Recommender Systems 18
Collaborative Recommender Systems 19
Collaborative Recommender Systems 20
Personalization Based on User Behavior Data: Data Mining Approach Data Preparation / Modeling Phase Typically an Offline Process Pattern Discovery Phase Implicit or explicit User preference data (clicktrhoughs, ratings, purchases, reviews Pattern Analysis Pattern Filtering Aggregation Characterization Content & Structure Data Preprocessing Data Cleaning Data Integration Data Transformation Event Model Generation Sessionization Data Mining Patterns Aggregate User Models Domain Knowledge User Transaction / Preference Database User Segmentation Item Clustering / Similarity User/Item Classification Correlation Analysis Association Rule Mining Sequential Pattern Mining 21
Personalization Based on User Behavior Data: Data Mining Approach Online Process Recommendation Engine Aggregate User Models <user,item1,item2, > Stored User Profile Integrated User Profile Recommendations, Predictions Domain Knowledge Active Session Web Server Client Application 22
New Challenges Context-Awareness Can systems stems understand user s context, t situation, current intentions? Need to understand task being gp performed; user s environment, domain knowledge/characteristics; short-term and long-term preferences Integrating ti Domain Knowledge Most current modeling approaches focus on the discovery of shallow patterns DM + Domain Knowledge (DM + AI) intelligent apps that can reason about / explain patterns 23
New Challenges Security / Trust / Reputation Many user adaptive systems vulnerable to malicious manipulation (e.g., shilling ) Need more robust algorithms and ways to detect malicious profiles In social systems the notion of reputation ti beocmes critical Serendipity Most predictive models not necessarily the best Need the ability to surprise or provide novelty Big Data Challenges Questions of scale require new frameworks and algorithms Wide variation in user behaviors require more sophisticated models (e.g., matrix factorization, hybrid / ensemble models) 24
Challenges:: Problems of Scale 25
New Opportunities:: Social Annotation Systems 26
Amazon Example: Tags describe the Resource Tags can describe The resource (genre, actors, etc) Organizational (toread) Subjective (awesome) Ownership (abc) etc
Tag Recommendation
Example: Tags describe the user These systems are collaborative. Recommendation / Analytics based on the wisdom of crowds. Rai Aren's profile co-author Secret of the Sands"
New Opportunities:: Social Recommendation A form of collaborative filtering using social network data Users profiles represented as sets of links to other nodes (users or items) in the network Prediction problem: infer a currently non-existent link in the network 30
Conclusions Personalization and Recommendation Technologies The killer app for predictive data analytics Will drive the next generation of Web applications Lots of new (and old) challenges New: Social media and social networks provide new challenges and opportunities; big data challenges scalability and effectiveness of old algorithms Old: scalability, sparsity, scrutability, serendipity Promising new work: New approaches to hybridization Social media analytics Context-aware recommendation / personalization 31