Big Data Challenges for Information Retrieval



Similar documents
SEO: What is it and Why is it Important?

Keyword Research for Social Media

Fast Data in the Era of Big Data: Twitter s Real-

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

The GIJP Tech team can offer assistance setting up and implementing any of the services mentioned in this document.

PUSH INTELLIGENCE. Bridging the Last Mile to Business Intelligence & Big Data Copyright Metric Insights, Inc.

Insights for Microsoft Dynamics CRM Online User s Guide December 2014

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Facebook Smart Card FB _1800


Content Marketing Templates

Comparative Analysis of Google Panda and Penguin SEO Algorithms on Blogs

A quick guide to. Social Media

News English.com Ready-to-Use English Lessons by Sean Banville

Scalable Machine Learning - or what to do with all that Big Data infrastructure

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Analysis of Social Media Streams

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data and Open Data

SEO Marketing Strategy. Keeping you connected through SEO

Search and Information Retrieval

Big Systems, Big Data

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Search Engine Optimization Content is Key. Emerald Web Sites-SEO 1

Google Product. Google Module 1

Social Media Marketing. Hours 45

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Real Time Analytics for Big Data. NtiSh Nati

Internet tools and techniques at this level will be defined as advanced because:

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

The Social Media Plan

Lead Generation Lessons From 4,000 Businesses. A study based on real data from 4,000 businesses

Sentiment Analysis on Big Data

Insurance Marketing White Paper The benefits of implementing marketing automation into your marketing strategy

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

SOCIAL MEDIA: The Tailwind for SEO & Lead Generation

Large-Scale Test Mining

an Essential Marketing Grow Your Business Online: The Commercial Approach to Search Engine Marketing Prepared by Flex4, December 2011, v1.

The Need for PDF Search Search and Index Overview IFilter Architecture Performance and Scalability Are Essential...

The Noisy Query Layer: How Brands Can Avoid Chasing Their Tails

Streamdrill: Analyzing Big Data Streams in Realtime

Becoming an Agile Digital Detective

Strategic Execution for Restaurant Rewards App. Implementation of content strategy spanning search, blog, and social

Giuseppe Riccardi, Marco Ronchetti. University of Trento

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

Pharmacy Affairs Branch. Website Database Downloads PUBLIC ACCESS GUIDE

NEXT Analytics User Guide for Facebook

B2B Social Media Marketing LeadFormix Best Practices

WHITEPAPER. Unlocking Your ATM Big Data : Understanding the power of real-time transaction analytics.

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

MAPS/REPUTATION DASHBOARD

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Promoting your presence at the show

Facebook and Social Networking Security

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Quantifind s story: Building custom interactive data analytics infrastructure

11 Core Elements Of A Successful Digital and Content Marketing Campaign. RODA marketing is a full service digital marketing and consulting agency.

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Alert Notification as a Service

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Introduction to Inbound Marketing

Global Monitoring + Support

The 4 Pillars of Technosoft s Big Data Practice

Social Recruiting How to Effectively Use Social Networks to Recruit Talent

Digital Marketing Training Institute

Tuning poor performing SQL s Using Oracle 10g Enterprise Manager s Automatic SQL Tuning Advisor

Quick Guide: Selecting ICT Tools for your Business

Google Analytics & Social Media Monitoring Jeremy Coates

Stand OUT Stay TOP of mind Sell MORE

1. Layout and Navigation

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Computational Advertising Andrei Broder Yahoo! Research. SCECR, May 30, 2009

Big Data Patterns. Ron Bodkin Founder and President, Think Big

IT Tools for SMEs and Business Innovation

Small Business Internet Marketing. Just What You Want to Know (So, What Do You Want to Know?)

User Documentation SEO EXPERT

INTERSEC BENCHMARK. High Performance for Fast Data & Real-Time Analytics Part I: Vs Hadoop

Search Engine Optimization & Social Media

So what is this session all about?

BIG DATA ANALYTICS For REAL TIME SYSTEM

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

Digital Marketing Capabilities

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

SEO Services. Climb up the Search Engine Ladder

Transcription:

UNIVERSITY OF COPENHAGEN DEPARTMENT OF COMPUTER SCIENCE Faculty of Science Big Data Challenges for Information Retrieval Christina Lioma Department of Computer Science c.lioma@diku.dk Slide 1/8

Information Retrieval: needles in haystacks Branch of computer science behind search engines: find information among large, noisy, heterogeneous data Slide 2/8 Christina Lioma Big Data Challenges for Information Retrieval

Information Retrieval: needles in haystacks Branch of computer science behind search engines: find information among large, noisy, heterogeneous data Slide 2/8 Christina Lioma Big Data Challenges for Information Retrieval a known needle in a known haystack a known needle in an unknown haystack an unknown needle in an unknown haystack any needle in a haystack the sharpest needle in a haystack most of the sharpest needles in a haystack all the needles in a haystack affirmation of no needles in the haystack things like needles in any haystack let me know whenever a new needle shows up where are the haystacks? needles, haystacks - whatever

Search engines in a nutshell Three main types of ingredients (features): 1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and (probably) relevant 3 Users: search behaviour, click behaviour, dwell behaviour Slide 3/8 Christina Lioma Big Data Challenges for Information Retrieval

Search engines in a nutshell Three main types of ingredients (features): 1 Words: text semantics can be approximated by word frequencies 2 Web structure: if enough people point to something, it must be good and (probably) relevant 3 Users: search behaviour, click behaviour, dwell behaviour User queries: distribution over features INPUT Indexed documents: distribution over features INPUT Ranking: comparing distributions OUTPUT Slide 3/8 Christina Lioma Big Data Challenges for Information Retrieval

Anno 2013 Realtime indexing: 20 billion pages crawled per day Instant search: retrieval time < 0.3 sec, faster than human typing Zero query search: try to retrieve information before you know what you are looking for based on user profiling In terms of scale: 50 billion indexed webpages 3 billion search requests per day 1 (world population: ca. 7 billion people) 1 Google alone Slide 4/8 Christina Lioma Big Data Challenges for Information Retrieval

Anno 2013 Realtime indexing: 20 billion pages crawled per day Instant search: retrieval time < 0.3 sec, faster than human typing Zero query search: try to retrieve information before you know what you are looking for based on user profiling In terms of scale: 50 billion indexed webpages 3 billion search requests per day 1 (world population: ca. 7 billion people) Data-driven technology Big Data challenges 1 Long Data 2 Your Data 3 Small Data Thinking 1 Google alone Slide 4/8 Christina Lioma Big Data Challenges for Information Retrieval

Big data challenge 1: long data Long as in longitudinal: spanning over time The problem is not the range but the intervals: dynamic streams of data coming in with timestamps per < seconds Implications to search engines: time-versioned indexing: fine-grained updates & threaded associations time-travel queries: what is relevant depends on when Slide 5/8 Christina Lioma Big Data Challenges for Information Retrieval

Big data challenge 2: your data Personalisation. Can of worms. We can collect your data BUT it is safer not to personalise rather than annoy you... Slide 6/8 Christina Lioma Big Data Challenges for Information Retrieval

Big data challenge 2: your data Personalisation. Can of worms. We can collect your data BUT it is safer not to personalise rather than annoy you... Big data implications: Personalised data on two axes: individual (e.g. user click through, preferences, history) and social (e.g. twitter, Facebook, blogs) Search engines must translate all this data into a single user state reflecting user preferences This state needs to be updated dynamically with every new input, but also remain consistent and below the nuisance threshold The larger and noisier the input, the harder to keep this balance Slide 6/8 Christina Lioma Big Data Challenges for Information Retrieval

Big data challenge 3: small data thinking R&D in information retrieval: clear division between efficiency and effectiveness Efficiency: index compression, reducing lookup time, query caching... Is not always on-topic Effectiveness: accurate feature extraction, personalisation, relevance... Does not always scale Slide 7/8 Christina Lioma Big Data Challenges for Information Retrieval

Sources Haystack image, page 2: http://footprinthr.com.au/wp-content/uploads/2012/01/needle_haystack.jpg Needles in haystack metaphor, page 2: Matthew Koll, Bulletin of the American Society for Information Science, Vol. 2, No. 2, December/January 2000 Typewriter image, page 3: Copyright: Roberto Zilli,, ID: 99118544, available from http://www.shutterstock.com Distributions image, page 3: Source: Edgar Meij, Large-scale Data Processing for Information Retrieval, 2012 Tweets image, page 5: Source: http://blog.crowdbooster.com/take-control-of-your-twitter-data-introducing Can of worms image, page 6: Copyright: munchester2cool, available from http://munchester2cool.deviantart.com/art/luke-s-can-of-worms-55442402 Efficiency vs. effectiveness image, page 7: http://psychologyface.com/2012/11/effectiveness-and-efficiency Slide 8/8 Christina Lioma Big Data Challenges for Information Retrieval