INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data



Similar documents
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Chapter 18. Power Laws and Rich-Get-Richer Phenomena Popularity as a Network Phenomenon

A discussion of Statistical Mechanics of Complex Networks P. Part I

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Search Engines. Stephen Shaw 18th of February, Netsoc

The Shape of the Network. The Shape of the Internet. Why study topology? Internet topologies. Early work. More on topologies..

1 o Semestre 2007/2008

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

Random graphs and complex networks

CMU SCS Large Graph Mining Patterns, Tools and Cascade analysis

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Server Load Prediction

Examining the Six Degrees in Film: George Clooney as the New Kevin Bacon

Network Analysis Basics and applications to online data

Extracting Information from Social Networks

End of Life Content Report November Produced By The NHS Choices Reporting Team

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Protein Prospector and Ways of Calculating Expectation Values

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

IC05 Introduction on Networks &Visualization Nov

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Predictive Modeling Techniques in Insurance

Scale-Free Extreme Programming

AP Physics 1 and 2 Lab Investigations

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

9 Retention. From Code to Product gidgreen.com/course

Introduction to Hadoop

WHERE DOES THE 10% CONDITION COME FROM?

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Network Working Group

Analyzing Big Data with AWS

1 Maximum likelihood estimation

Efficient Identification of Starters and Followers in Social Media

A Corpus Linguistics-based Approach for Estimating Arabic Online Content

Business Challenges and Research Directions of Management Analytics in the Big Data Era

The Internet Is Like A Jellyfish

Economic Statistics (ECON2006), Statistics and Research Design in Psychology (PSYC2010), Survey Design and Analysis (SOCI2007)

Mobile Completes the Web

Today s Class. Kinds of Networks. What s a Network? Normal Distribution. Properties of Networks. What are networks Properties of networks

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Facebook Ads: Local Advertisers. A Guide for. Marketing Research and Intelligence Series. From the Search Engine People. Search Engine People

The Big Data Paradigm Shift. Insight Through Automation

77 Top SEO Ranking Factors

The Impact of YouTube Recommendation System on Video Views

Visualising Space Time Dynamics: Plots and Clocks, Graphs and Maps

Graph Mining Techniques for Social Media Analysis

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

First Foray into Text Analysis with R

2 Binomial, Poisson, Normal Distribution

Non-Parametric Tests (I)

Six Degrees of Separation in Online Society

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Twitter Analytics: Architecture, Tools and Analysis

Approximating PageRank from In-Degree

Measurement and Metrics Fundamentals. SE 350 Software Process & Product Quality

Predicting the Stock Market with News Articles

Statistical Distributions in Astronomy

Search and Information Retrieval

Applying Machine Learning to Stock Market Trading Bryce Taylor

Doing Multidisciplinary Research in Data Science

Analyzing the Facebook graph?

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Introduction to Networks and Business Intelligence

2.500 Threshold e Threshold. Exponential phase. Cycle Number

media kit 2014 PUBLISH / DEVELOP Global Mobile Ad Network

SOCIAL ENGAGEMENT BENCHMARK REPORT THE SALESFORCE MARKETING CLOUD. Metrics from 3+ Million Twitter* Messages Sent Through Our Platform

Basic indexing pipeline

How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer

Gutenberg 3.2 Ebook-Piracy Report

Chapter 3 RANDOM VARIATE GENERATION

Big Data Scoring. April 2014

Evaluating the impact of research online with Google Analytics

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Professor Yashar Ganjali Department of Computer Science University of Toronto.

Probit Analysis By: Kim Vincent

Capturing Meaningful Competitive Intelligence from the Social Media Movement

BeeSocial. Create A Buzz About Your Business. Social Media Marketing. Bee Social Marketing is part of Genacom, Inc.

Concentration of Trading in S&P 500 Stocks. Recent studies and news reports have noted an increase in the average correlation

Introduction to Predictive Analytics. Dr. Ronen Meiri

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

BIG DATA IN SCIENCE & EDUCATION

Using R for Social Media Analytics

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Social Media Statistics. What is going on out there

BINOMIAL DISTRIBUTION

MINFS544: Business Network Data Analytics and Applications

Continuing and evolving evaluation the big data way

GENERAL OVERVIEW OF VARIOUS SSO SYSTEMS: ACTIVE DIRECTORY, GOOGLE & FACEBOOK

Analysis of Internet Topologies: A Historical View

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Simple Language Models for Spam Detection

Should HR Care About Big Data? E D COHEN, L E A R NING I NDUSTRY CONSULTA NT

Bret Phillips. Web Web Devils offers the following services:

AP Statistics. Chapter 4 Review

The ebay Graph: How Do Online Auction Users Interact?

6 Selling Advertising. From Code to Product gidgreen.com/course

Language Entropy: A Metric for Characterization of Author Programming Language Distribution

Transcription:

INFO 2950 Intro to Data Science Lecture 17: Power Laws and Big Data Paul Ginsparg Cornell University, Ithaca, NY 29 Oct 2013 1/25

Power Laws in log-log space y = cx k (k=1/2,1,2) log 10 y = k log 10 x +log 10 c 100 90 sqrt(x) x x**2 100 sqrt(x) x x**2 80 70 60 50 10 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 1 1 10 100 2/25

Zipf s law Now we have characterized the growth of the vocabulary in collections. We also want to know how many frequent vs. infrequent terms we should expect in a collection. In natural language, there are a few very frequent terms and very many very rare terms. Zipf s law (linguist/philologist George Zipf, 1935): The i th most frequent term has frequency proportional to 1/i. cf i 1 i cf i is collection frequency: the number of occurrences of the term t i in the collection. 3/25

http://en.wikipedia.org/wiki/zipf s law Zipf s law: the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. Brown Corpus: the : 7% of all word occurrences (69,971 of >1M). of : 3.5% of words (36,411) and : 2.9% (28,852) Only 135 vocabulary items account for half the Brown Corpus. The Brown University Standard Corpus of Present-Day American English is a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources... for many years among the most-cited resources in the field. 4/25

Zipf s law Zipf s law: The i th most frequent term has frequency proportional to 1/i. cf i 1 i cf is collection frequency: the number of occurrences of the term in the collection. So if the most frequent term (the) occurs cf 1 times, then the second most frequent term (of) has half as many occurrences cf 2 = 1 2 cf 1......and the third most frequent term (and) has a third as many occurrences cf 3 = 1 3 cf 1 etc. Equivalent: cf i = ci k and logcf i = logc+klogi (for k = 1) Example of a power law 5/25

Power Laws in log-log space y = cx k (k=1/2,1,2) log 10 y = k log 10 x +log 10 c 100 90 100/sqrt(x) 100/x 100/x**2 100 100/sqrt(x) 100/x 100/x**2 80 70 60 50 10 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 1 1 10 100 6/25

Zipf s law for Reuters log10 cf 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 log10 rank Fit far from perfect, but nonetheless key insight: Few frequent terms, many rare terms. 7/25

more from http://en.wikipedia.org/wiki/zipf s law A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are the, of and and, as expected. Zipf s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line. 8/25

Power laws more generally E.g., consider power law distributions of the form cr k, describing the number of book sales versus sales-rank r of a book, or the number of Wikipedia edits made by the r th most frequent contributor to Wikipedia. Amazon book sales: cr k, k.87 number of Wikipedia edits: cr k, k 1.7 (More on power laws and the long tail here: Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg Chpt 18: http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch18.pdf) 9/25

Wikipedia edits/month Amazon sales/week 1000 800 600 400 200 1258925 / r^{1.7} 40916 / r^{.87} Normalization given by the roughly 1 sale/week for the 200,000th ranked Amazon title: 40916r.87 and by the 10 edits/month for the 1000th ranked Wikipedia editor: 1258925r 1.7 0 0 200 400 600 800 1000 User Book rank r 1e+07 1e+06 Wikipedia edits/month Amazon sales/week 100000 10000 1000 100 10 1 1258925 / r^{1.7} 40916 / r^{.87} Long tail: about a quarter of Amazon book sales estimated to come from the long tail, i.e., those outside the top 100,000 bestselling titles 0.1 1 10 100 1000 10000 100000 1e+06 User Book rank r 10/25

Another Wikipedia count (15 May 2010) http://imonad.com/seo/wikipedia-word-frequency-list/ All articles in the English version of Wikipedia, 21GB in XML format (five hours to parse entire file, extract data from markup language, filter numbers, special characters, extract statistics): Total tokens (words, no numbers): T = 1,570,455,731 Unique tokens (words, no numbers): M = 5,800,280 11/25

Word frequency distribution follows Zipf s law 12/25

rank 1 50 (86M-3M), stop words (the, of, and, in, to, a, is,...) rank 51 3K (2.4M-56K), frequent words (university, January, tea, sharp,...) rank 3K 200K (56K-118), words from large comprehensive dictionaries (officiates, polytonality, neologism,...) above rank 50K mostly Long Tail words rank 200K 5.8M (117-1), terms from obscure niches, misspelled words, transliterated words from other languages, new words and non-words (euprosthenops, eurotrochilus, lokottaravada,...) 13/25

Some selected words and associated counts Google 197920 Twitter 894 domain 111850 domainer 22 Wikipedia 3226237 Wiki 176827 Obama 22941 Oprah 3885 Moniker 4974 GoDaddy 228 14/25

Project Gutenberg (per billion) http://en.wiktionary.org/wiki/wiktionary:frequency_lists#project_gutenberg Over 36,000 items (Jun 2011), average of > 50 new e-books / week http://en.wiktionary.org/wiki/wiktionary:frequency_lists/pg/2006/04/1-10000 the 56271872 of 33950064 and 29944184 to 25956096 in 17420636 I 11764797 that 11073318 was 10078245 his 8799755 he 8397205 it 8058110 with 7725512 is 7557477 for 7097981 as 7037543 had 6139336 you 6048903 not 5741803 be 5662527 her 5202501...100,000 th 15/25

Bowtie structure of the web A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,S. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309 320, 2000. Strongly connected component (SCC) in the center Lots of pages that get linked to, but don t link (OUT) Lots of pages that link to other pages, but don t get linked to (IN) Tendrils, tubes, islands # of in-links (in-degree) averages 8 15, not randomly distributed (Poissonian), instead a power law: # pages with in-degree i is 1/i α, α 2.1 16/25

Poisson Distribution Bernoulli process with N trials, each probability p of success: ( ) N p(m) = p m (1 p) N m. m Probability p(m) of m successes, in limit N very large and p small, parametrized by just µ = Np (µ = mean number of successes). N! For N m, we have (N m)! = N(N 1) (N m+1) Nm, so ( N) m N! m!, and m!(n m)! Nm p(m) 1 m! Nm( µ ) m ( 1 µ ) N m µ m ( lim 1 µ ) N = e µ µm N N m! N N m! (ignore (1 µ/n) m since by assumption N µm). N dependence drops out for N, with average µ fixed (p 0). The form p(m) = e µµm m! is known as a Poisson distribution (properly normalized: m=0 p(m) = e µ µ m m=0 m! = e µ e µ = 1). 17/25

Compare to power law p(m) 1/m 2.1 18/25 Poisson Distribution for µ = 10 p(m) = e 1010m m! 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 30

Power Law p(m) 1/m 2.1 and Poisson p(m) = e 1010m m! 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 10 20 30 40 50 60 70 80 90 100 19/25

Power Law p(m) 1/m 2.1 and Poisson p(m) = e 1010m 1 m! 0.1 (log log scale) 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 1 10 100 1000 10000 20/25

Power law distributions Slide credit: Dragomir Radev 21/25

Examples Moby Dick scientific papers 1981-1997 AOL users visiting sites 97 bestsellers 1895-1965 AT&T customers on 1 day California 1910-1992 22/25

Moon Solar flares wars (1816-1980) richest individuals 2003 US family names 1990 US cities 2003 23/25

Power law in networks! For many interesting graphs, the distribution over node degree follows a power law film actors telephone call graph email networks sexual contacts WWW internet peer-to-peer metabolic network protein interactions exponent! (in/out degree) 2.3 2.1 1.5/2.0 3.2 2.3/2.7 2.5 2.1 2.2 2.4 Slide credit: Dragomir Radev 24/25

Next Time: More Statistical Methods Peter Norvig, How to Write a Spelling Corrector http://norvig.com/spell-correct.html (See video: http://www.youtube.com/watch?v=yvdczhbjyws The Unreasonable Effectiveness of Data, given 23 Sep 2010.) Additional related references: http://doi.ieeecomputersociety.org/10.1109/mis.2009.36 A. Halevy, P. Norvig, F. Pereira, The Unreasonable Effectiveness of Data, Intelligent Systems Mar/Apr 2009 (copy at resources/unrealdata.pdf) http://norvig.com/ngrams/ch14.pdf P. Norvig, Natural Language Corpus Data 25/25