TAAI 2012 Panel Discussion: Big Data Chin Yew Lin cyl@microsoft.com Microsoft Research Asia About Me: Chin Yew Lin Senior Researcher, Knowledge Mining Group, Microsoft Research Asia Areas of Interest Natural language understanding Knowledge mining Social computing Planning AFNLP SIG on Semantics and Knowledge Most recently Program co chair of ACL 2012 Program co chair of AAAI AI & the Web 2011 Previously ROUGE: automatic evaluation of summaries 1
2
* Gartner Hype Cycle Big Data 2012 3
* http://www.npr.org/2011/11/29/142521910/the digital breadcrumbs that lead to big data Decide.com 25GB per day (Nov 29, 2011) ~Read 150K books per day ~500 pages per day = 800KB of data 100 TB = 600M books http://www.npr.org/2011/11/29/142521910/th e digital breadcrumbs that lead to big data Largest Facebook cluster: 100 PB (Nov 8, 2012) ½ PT new data per day 60,000 Hive queries/day https://www.facebook.com/notes/facebookengineering/under the hood schedulingmapreduce jobs more efficiently withcorona/10151142560538920 * http://www.engadget.com/2011/06/29/visualized a zettabyte/ 4
She liked to watch the America s Got Talent show. She shared lots of travel experience. Cruisecritic Youtube 1. Basic Information Jane Doe Female, 52 years old, Married Live: New York, New York, USA Hometown: Boston, MA, USA She has a second-hand guitar. Gibson 4. Interests 2. Personal life Users celebrated her birthday online. Ozcruiseclub janedoe2012 Flickr Places traveled. She found it was difficult to get along with her son. ebay Mentions of her family Schizophrenia 3. Relationship Bought Cigarette online A smoker? Aggregation of Dynamic User Activity Information Basic Information Personal Life Conversations Interests Facebook Twitter Flickr ebay Youtube Cruisecritic Schizophrenia Ozcruiseclub Jane Doe Female, 52 years old, Married Live: New York, New York, USA Hometown: Boston, USA Female, 56years old, Married A second-hand guitar Buy Cigarette on ebay Live: Family North issues Haven, New South Post Wales, travel experience Australia Visit the Pacific Dawn on Celebrate birthday online Dec.19.2007 Hometown: Moss Vale, New Like the South America s Wales Got Talent show 5
Aggregation of Dynamic User Activity Information Basic Information Personal Life Conversations Interests Facebook Twitter Flickr ebay Youtube Cruisecritic Schizophrenia Ozcruiseclub Not a network of social relationships but a network of shared interests She bought Her Cigarette son takes online, drugs. so she does smoke. Female, 56years old, Married Jen She Bailey went (Jennifer to She visit found Blackie) the Pacific it was difficult Dawn to get Live: on along North Dec.19.2007. with her son. Users celebrate her birthday online. Haven, New South Wales, Australia Hometown: Moss Vale, New South Wales * http://www.engadget.com/2011/06/29/visualized a zettabyte/ RGB Data R Right G Good B Big 6
Most Interesting Task A.I. H.I. 7
Virtual World Real World NeedleSeek: Computable Knowledge Mining open domain semantic knowledge from web scale data sources Empower apps with computable knowledge Decrease Find needles in a haystack Improve Increase Tail CEO Revenue Habitat Cat Animal Head News Channel Founder Company High-Tech Company Profit Fortune 500 Company Underwear Athlete Breed Dog Bark Fox Apple Microsoft Product Boxer Beagle Owner Windows OS 8
Data Scale NeedleSeek: Current Status V2.0 (May 2010) V2.5 Terms: 20 million Links: 1 billion Categories: 10M Head labels: 300K Terms: 80M (EN); 40M (CN); 12M (JP) Links: 2.4B (EN); 1.5B (CN); 0.6B (JP) Categories: 30M (EN) Head labels: 500K (EN) Freshness Mar 2009 data Mar 2012 data Language English English; Chinese; Japanese; English+Chinese Knowledge Types External Data Integration Peer similarity; Entity cluster (semantic classes); Entity property: Hypernymy (IsA); Attributes All V2.0 knowledge types; General relations; Feature vectors for entities; entity key sentences Freebase; available structures databases * V2.0 Demo: http://needleseek.msra.cn Data Scale NeedleSeek: Current Status V2.0 (May 2010) V2.5 Terms: 20 million Links: 1 billion Categories: 10M Head labels: 300K Terms: 80M (EN); 40M (CN); 12M (JP) Links: 2.4B (EN); 1.5B (CN); 0.6B (JP) Categories: 30M (EN) Head labels: 500K (EN) Freshness Mar 2009 data Mar 2012 data Language English English; Chinese; Japanese; Glossary English+Chinese Knowledge Term: Literal Peerstring similarity; ( iron ; gone All with V2.0 the knowledge wind ; 狗 types; ; Lumia 800 ) TypesEntity Type: Entity city; cluster animal; (semantic book; film; General person; relations; actress Feature vectors for classes); Entity entities; entity key sentences Entity: Something that we refer to with a specific type property: Hypernymy Property: (IsA); Peer(fox, Attributes dog); IsA(fox, animal); Attr(city, population) External Entity Data Cluster with Peer Similarity: Freebase; {Beijing: available Beijing, structures Shanghai, databases Integration Guangzhou }; {apple: apple, pear, orange, watermelon } * V2.0 Demo: http://needleseek.msra.cn 9
Project SOUL Mining big social data for information discovery and recommendation Build a big entity database (knowledge) Open domain + domain specific curated databases Build a big people database (profiles) People who act on the web Build a big event database (logs) Social interaction records on the web Who do what to whom, when, where, how, and why (intent) Ex: review, QA, comment, tag, share, like, tweet, blog Design algorithms leveraging these databases Develop services & apps enabled by these databases and algorithms SOUL: Toward Big Social Data Analytics People Database People centric People discovery People selection People indexing People ranking Cover people and their friends Event Database Event centric Event discovery Event selection Event indexing Event ranking Cover events; link entities to people and vice versa Entity Database Entity centric Entity discovery Entity selection Entity indexing Entity ranking Cover documents and solutions Algorithms Services Applications PEN Graph 10
* Nokia City Lens Urban Computing With City Dynamics Yu Zheng, Jing Yuan, Xing Xie Data Management, Analytics, and Services Group 11
Sensing What s Urban Computing Improving Urban Computing Mining Understanding Everything in urban areas are used to sense city dynamics and to create a city wide computing graph to tackle the challenges in serving citizens and cities. KDD 12 and ICDE 12 Route Construction from Uncertain Trajectories ACM SIGSPAITAL GIS 10 best paper runner up, KDD 11 Finding Smart Driving Directions Discovery of Functional Regions KDD 12 Ubicomp 11 Passengers Cabbie Recommender system Anomalous Events Detection KDD 11 and ICDM 12 Ubicomp 11 Best paper nominee Urban Computing for Urban Planning 12
Volume Velocity Variety 13