UNIVERSITY OF COPENHAGEN DEPARTMENT OF COMPUTER SCIENCE Faculty of Science Big Data Challenges for Information Retrieval Christina Lioma Department of Computer Science c.lioma@diku.dk Slide 1/8
Information Retrieval: needles in haystacks Branch of computer science behind search engines: find information among large, noisy, heterogeneous data Slide 2/8 Christina Lioma Big Data Challenges for Information Retrieval
Information Retrieval: needles in haystacks Branch of computer science behind search engines: find information among large, noisy, heterogeneous data Slide 2/8 Christina Lioma Big Data Challenges for Information Retrieval a known needle in a known haystack a known needle in an unknown haystack an unknown needle in an unknown haystack any needle in a haystack the sharpest needle in a haystack most of the sharpest needles in a haystack all the needles in a haystack affirmation of no needles in the haystack things like needles in any haystack let me know whenever a new needle shows up where are the haystacks? needles, haystacks - whatever
Search engines in a nutshell Three main types of ingredients (features): 1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and (probably) relevant 3 Users: search behaviour, click behaviour, dwell behaviour Slide 3/8 Christina Lioma Big Data Challenges for Information Retrieval
Search engines in a nutshell Three main types of ingredients (features): 1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and (probably) relevant 3 Users: search behaviour, click behaviour, dwell behaviour User queries: distribution over features INPUT Indexed documents: distribution over features INPUT Ranking: comparing distributions OUTPUT Slide 3/8 Christina Lioma Big Data Challenges for Information Retrieval
Anno 2013 Realtime indexing: 20 billion pages crawled per day Instant search: retrieval time < 0.3 sec, faster than human typing Zero query search: try to retrieve information before you know what you are looking for based on user profiling In terms of scale: 50 billion indexed webpages 3 billion search requests per day 1 (world population: ca. 7 billion people) 1 Google alone Slide 4/8 Christina Lioma Big Data Challenges for Information Retrieval
Anno 2013 Realtime indexing: 20 billion pages crawled per day Instant search: retrieval time < 0.3 sec, faster than human typing Zero query search: try to retrieve information before you know what you are looking for based on user profiling In terms of scale: 50 billion indexed webpages 3 billion search requests per day 1 (world population: ca. 7 billion people) Data-driven technology Big Data challenges 1 Long Data 2 Your Data 3 Small Data Thinking 1 Google alone Slide 4/8 Christina Lioma Big Data Challenges for Information Retrieval
Big data challenge 1: long data Long as in longitudinal: spanning over time The problem is not the range but the intervals: dynamic streams of data coming in with timestamps per < seconds Implications to search engines: time-versioned indexing: fine-grained updates & threaded associations time-travel queries: what is relevant depends on when Slide 5/8 Christina Lioma Big Data Challenges for Information Retrieval
Big data challenge 2: your data Personalisation. Can of worms. We can collect your data BUT it is safer not to personalise rather than annoy you... Slide 6/8 Christina Lioma Big Data Challenges for Information Retrieval
Big data challenge 2: your data Personalisation. Can of worms. We can collect your data BUT it is safer not to personalise rather than annoy you... Big data implications: Personalised data on two axes: individual (e.g. user click through, preferences, history) and social (e.g. twitter, Facebook, blogs) Search engines must translate all this data into a single user state reflecting user preferences This state needs to be updated dynamically with every new input, but also remain consistent and below the nuisance threshold The larger and noisier the input, the harder to keep this balance Slide 6/8 Christina Lioma Big Data Challenges for Information Retrieval
Big data challenge 3: small data thinking R&D in information retrieval: clear division between efficiency and effectiveness Efficiency: index compression, reducing lookup time, query caching... Is not always on-topic Effectiveness: accurate feature extraction, personalisation, relevance... Does not always scale Slide 7/8 Christina Lioma Big Data Challenges for Information Retrieval
Sources Haystack image, page 2: http://footprinthr.com.au/wp-content/uploads/2012/01/needle_haystack.jpg Needles in haystack metaphor, page 2: Matthew Koll, Bulletin of the American Society for Information Science, Vol. 2, No. 2, December/January 2000 Typewriter image, page 3: Copyright: Roberto Zilli,, ID: 99118544, available from http://www.shutterstock.com Distributions image, page 3: Source: Edgar Meij, Large-scale Data Processing for Information Retrieval, 2012 Tweets image, page 5: Source: http://blog.crowdbooster.com/take-control-of-your-twitter-data-introducing Can of worms image, page 6: Copyright: munchester2cool, available from http://munchester2cool.deviantart.com/art/luke-s-can-of-worms-55442402 Efficiency vs. effectiveness image, page 7: http://psychologyface.com/2012/11/effectiveness-and-efficiency Slide 8/8 Christina Lioma Big Data Challenges for Information Retrieval