Big Data and Scripting



Similar documents
Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

Big Data & Scripting Part II Streaming Algorithms

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Statistical Machine Translation: IBM Models 1 and 2

EASY $65 PAYDAY FREE REPORT

Big Data Analytics. Lucas Rego Drumond

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

A H A C C O U N T I N G & T R A I N I N G S E R V I C E S P a g e 1

Guide to Media Evaluation. 1. What is the purpose of Media Evaluation? 2. What forms can it take?

Collecting Polish German Parallel Corpora in the Internet

Office: LSK 5045 Begin subject: [ISOM3360]...

Predicting Flight Delays

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology Madras

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

CS 40 Computing for the Web

Checklist: Are you ready for ecommerce?

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Business-to-Business Marketing Introduction

Strategies for Effective Tweeting: A Statistical Review

Affiliate Marketing, Start for Free

PERPETUITIES NARRATIVE SCRIPT 2004 SOUTH-WESTERN, A THOMSON BUSINESS

248 Insurance PLR Articles (included with sale of 5 Premium Insurance Domains)

Minimax Strategies. Minimax Strategies. Zero Sum Games. Why Zero Sum Games? An Example. An Example

cprax Internet Marketing

HOW TO SUCCEED WITH NEWSPAPER ADVERTISING

14.74 Lecture 11 Inside the household: How are decisions taken within the household?

GCSE Business Studies

Aim To help students prepare for the Academic Reading component of the IELTS exam.

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Management Information System Prof. Biswajit Mahanty Department of Industrial Engineering & Management Indian Institute of Technology, Kharagpur

Lecture Notes on MONEY, BANKING, AND FINANCIAL MARKETS. Peter N. Ireland Department of Economics Boston College.

WhatWorks in Log Management EventTracker at San Bernardino County Superior Court

The Mathematics 11 Competency Test Percent Increase or Decrease

INSIGHTS WHITEPAPER What Motivates People to Apply for an MBA? netnatives.com twitter.com/netnatives

Big Data Explained. An introduction to Big Data Science.

The big data revolution

Cloud Computing Summary and Preparation for Examination

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Sales - It s in your DNA. Find it, develop it and unleash your potential PROSPECTING IDENTIFYING & DEVELOPING NEW OPPORTUNITIES

Theories of Personality Psyc , Fall 2014

Language and Computation

LESSON TITLE: A Story about Investing. THEME: We should share the love of Jesus! SCRIPTURE: Luke 19:11-27 CHILDREN S DEVOTIONS FOR THE WEEK OF:

Make and register your lasting power of attorney a guide

Functional Skills English Sample Entry Level 3 Weather Reading Assessment Task Sheet

x64 Servers: Do you want 64 or 32 bit apps with that server?

FREE computing using Amazon EC2

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Your friend starts crying. He or she is married with two kids and a huge mortgage. Do you: Say you'll keep your mouth shut Go to 4

Nine Things You Must Know Before Buying Custom Fit Clubs

Let s start with a couple of definitions! 39% great 39% could have been better

Best Practice Search Engine Optimisation

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Filing a Form I-360 Self-Petition under the Violence Against Women Act

INFS5991 BUSINESS INTELLIGENCE METHODS

Data Mining: Benefits for business.

Frequently Asked Questions about New Leaf s National Accounts Program

Looking at Newspapers: Introduction

Social Media Tips for Small Businesses

Quick Preview PROPERTY DAMAGE

CHECK IT OUT CHECK IT OUT! Spring Contents. Age 9 11 Key Stage 2. Series Producer: Henry Laverty. Spring 2001

English for Academic Skills Independence [EASI]

Giant panda born in U.S. zoo

4 SIMPLE STEPS TO MAKE 1000 PER MONTH PASSIVELY IN JUST 3 MONTHS AMAZON MILLIONAIRE DISCOVERY AMAZON MILLIONAIRE DISCOVERY

Mobile Cloud Computing In Business

Economics Chapter 7 Review

Marketing 1, 2, Section 1 - Marketing Why Marketing?... 4 Learning the Basics... 5 Build your List...

AdWords Google AdWords Setup and Management

Kant s deontological ethics

Mkt501 final term subjective Solve Questions By Adnan Awan

Data mining techniques: decision trees

Language Modeling. Chapter Introduction

You CAN do More!...3 H) Put Your Banner on our Newsletter... 3 I) Sponsor a Live Event... 4

Understanding Trading Performance in an Order Book Market Structure TraderEx LLC, July 2011

GUIDE TO GOOGLE ADWORDS

University of Florida ADV 3502, Section Advertising Sales Spring 2016

6.080/6.089 GITCS Feb 12, Lecture 3

Text Analytics Illustrated with a Simple Data Set

Instant Site Flipping Riches

Adwords 100 Success Secrets. Google Adwords Secrets revealed, How to get the Most Sales Online, Increase Sales, Lower CPA and Save Time and Money

ETPL Extract, Transform, Predict and Load

13 Ways To Increase Conversions

University of Florida ADV 3502, Section 7E39 Advertising Sales Summer C 2016

General Psychology. Fall 2015

MAKE BIG MONEY QUICKLY! Low Start Up Cost! Easy To Operate Business! UNLIMITED INCOME POTENTIAL!

Marketing Concept. The Marketing Concept

THE SME S GUIDE TO COST-EFFECTIVE WEBSITE MARKETING

Is it possible to beat the lottery system?

Transcription:

Big Data and Scripting 1,

2, Big Data and Scripting - abstract/organization contents introduction to Big Data and involved techniques schedule 2 lectures (Mon 1:30 pm, M628 and Thu 10 am F420) 2 tutorials (Fri 10:00 am and 1:30 pm in F420) attend one tutorial written exams (July 30, October 14) lecture: Uwe Nagel (uwe.nagel@uni-konstanz.de) tutorials: Mark Ortmann

Big Data and Scripting - organizational stuff communication website: http://www.inf.uni-konstanz.de/algo/lehre/ss14/bds/ assignment sheets slides of the lecture announcements register for this lecture in the LSF! (lsf.uni-konstanz.de) assignments mostly implementation tasks involve languages/algorithms covered in the lecture discussion and help in tutorials requirement for exams: 50% of assignment points 3,

4, agenda - contents of this lecture tools and techniques for (distributed) computation Unix command line (i.e. bash scripting) Python (machine learning) NOSQL by example (databases) map/reduce (distributed computing) algorithms for (distributed) computation streaming algorithms memory hierarchies and distributed storage distributed and parallel algorithms

5, What this lecture does not cover data mining using some data mining algorithms this is not a data mining course lecture Data Mining: Artificial Intelligence recommender systems mentioned, no detailed coverage again, not a primary topic of this lecture

6, Prologue what does Big Data mean and why is that interesting questions: Is Big Data just another buzz word? Is there an actual meaning or advantageous technique? If so, where does that come from or what exactly is it? remainder of this lecture: 3 example applications increasing level of detail

7, Example 1: Amazon a simple example - Amazon basically a selling platform provides: connection of suppliers to customers a common market place (one interface for all shops) additional services (storage, shipment, payment, search) recommendation what is the difference to competitors? Amazon knows customers, products, sales and views same is true for its competitors

8, Example 1: Amazon in comparison, Amazon has much more customers more customers, more transactions, more views a larger data collection better recommendations estimate 1 : 1/3 of Amazon s sales generated by recommendations more data = better predictions? simple answer: essentially yes real answer: it s a bit more complicated 1 www.economist.com/blogs/graphicdetail/2013/02/ elusive-big-data

9, Intermission what are we trying to find out? learning/data mining and artificial intelligence are not that new somehow huge amounts of data can make a difference question: how and why? approach: analyze examples using big data 1. where is the big data 2. what kind of data is involved 3. what makes a large data base crucial

10, Example 2: Target and the pregnant teen Target a large discounter chain (similar to Walmart/Aldi) uses data analysis for marketing the story 2 Target advertises baby equipment to teen father complains, finds out that his daughter is actually pregnant proves that it predicts pregnancy better than family members 2 www.forbes.com/sites/kashmirhill/2012/02/16/ how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-

11, Example 2: Target and the pregnant teen - How? in a nutshell collect data about customers (who buys what, when) predict what they are interested in adjust advertisement to the specific person

12, Example 2: Target and the pregnant teen - How? 1. step: data collection create large base of data available about customers each customer gets some unique ID (credit card, email,... ) everything that can be connected to the customer is collected connected to customer ID used for interest prediction example of data to collect items purchased together time/place of purchase weather? - whatever can be collected

13, Example 2: Target and the pregnant teen - How? next: search for patterns simple: people buy what they always bought recommendation: customers who bought this usually also buy... concrete targeting, example: young parents a new child is a perfect opportunity: parents have to buy a lot of stuff (without having too much money) at this stage they are more likely bound to brands prediction of pregnancy is crucial for advertisement

Example 2: Target and the pregnant teen How? data gathering customers are described by their purchases items, time, payment method,... products can be described by purchases families products are mostly bought on weekends given enough records, patterns emerge typical purchase histories (as in the pregnancy example) typical customers (as in I always buy beer and chips ) new products that become popular vs. products that are ignored these patterns can be very complicated more data leads to more opportunities (e.g. more complicated 14,

15, Example 2: Target and the pregnant teen results patterns in the example quoting a Target analyst: they identified 25 products when analyzed together these allow a pregnancy prediction score example: pregnant women buy supplements like calcium, magnesium and zinc sometime in the 20 first weeks business impact start of program: 2002 revenue growth: $44 Billion (2002) $67 Billion (2010) it is assumed that data mining was crucial for this growth

16, Example 3: machine translation the task automatic translation of text given: text T in language A result: text T in language B example: Google s translator URL: http://translate.google.de/

17, Example 3: machine translation a naive approach word mappings hold a dictionary W : A B replace each w T by W (w) 1. problem: words don t match exactly between languages neither in meaning, nor in number 2. problem: grammar grammar is hard 1. problem: language is noisy, grammar is often not exact 2. problem: even exact grammar is hard (semantics, context) Chomsky hierarchy, theory of computer science

Example 3: machine translation a statistical approach new approach: translation by example model with very few assumptions and simple rules rules expressed by probabilities examples are taken from a corpus of manually translated documents basic idea model training: learn probability P that some text T is translation of text T Translation: find T with maximal P note: the following explains the principle and is not correct in every detail 18,

19, Example 3: machine translation breaking down probabilities example: translate french text F to english text E P(E F ) - prob. that E is correct translation of F let F = f 1 f 2... (f i sentence, E analogous) first splitting assumption: f i corresponds to e i E is correct, if each e i translates f i independent of other sentences P(E F ) = P(e i f i ) i try to maximize P(e i f i ) for all i independently find most probable translation sentence

20, Example 3: machine translation breaking down probabilities consider a concrete pair of sentences: Je ne vous connais pas. I don t know you. Je - I vous - you connais - know ne... pas - don t some observations words are translated (Je I) some words change place (vous you) some words change number (ne... pas don t)

Example 3: machine translation breaking down probabilities Je ne vous connais pas. I don t know you. formalize our observations into concrete probabilities: translation P(f e) f is translation of e (Je I) distortion P(t s, l) word at position t is replaced (you nous) by word at position s in sentence of lengt fertility P(n e) e is replaced by n french words (ne pas don t) 21,

Example 3: machine translation breaking down probabilities how does this help for P(E F )? recall assumption P(E F ) = i P(e i f i ) P(E F ) is high, if every P(e i f i ) is high same principle can be applied on the sentence level breaking up sentences P(f i, e i ) has many parts translation, distortion, fertility for every word some more, unknown combination by product (assuming independence) P(f i, e i ) 1, if all the parts 1 use translation, distortion, fertility as indicators 22,

23, Example 3: machine translation wrap up assumption: translation can be broke down to simple probabilities learning: estimate individual probabilities translation: find most probable sentences why is this a Big Data application? individual probabilities are learned from real, manually translated texts this is an individual problem for each pair of languages many individual probabilities have to be determined can be estimated from example texts (more texts, better estimates) quality grows with additional knowledge (texts) fed into the system

24, Example 3: machine translation missing data/open questions data sources many books translated into various languages European laws are translated into all European languages derive approximate probabilities by counting in corpus open matching of sentences and words finding the actual translation (we only have partial probabilities) further information: http://www.mt-archive.info/

25, discussion why does this work? it works only partially test for yourself: translate.google.com still, the quality of the results is surprising does it scale? why is it not always correct? what would be the impact of adding more data? can it be parallelized?

26, the notion Big Data no agreement on a definition usual understanding: methods that involve machine learning/data mining necessarily involve massive amounts of data improve with additional input

27, the remainder of this lecture some applications necessarily involve massive amounts of data this lecture is about handling such data we are (almost) not interested in the actual application we are interested in means to enable these applications two main points of interest practical scripting, command lines, programming theoretical algorithms for data storage, and handling basic algorithms for large data sets (e.g. sorting) next lecture: basics on the command line