Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

Similar documents

Big Data and Scripting

TYLER JUNIOR COLLEGE School of Continuing Studies 1530 SSW Loop 323 Tyler, TX

1 Choosing the right data mining techniques for the job (8 minutes,

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Management Information System Prof. Biswajit Mahanty Department of Industrial Engineering & Management Indian Institute of Technology, Kharagpur

Cloud Computing. Chapter Hadoop

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

DEFINITELY. GAME CHANGER? EVOLUTION? Big Data

Big Data Analytics. Lucas Rego Drumond

Language and Computation

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Colleen s Interview With Ivan Kolev

A Guide to Using HiFX

Big Data Integration: A Buyer's Guide

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Machine Translation. Agenda

Social Media Marketing for Small Business Success

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Big Data Big Deal? Salford Systems

7 WAYS HOW DESIGN THINKING CAN BOOST INSURANCE BUSINESS

Website Promotion for Voice Actors: How to get the Search Engines to give you Top Billing! By Jodi Krangle

LEAD CONVERSION SECRETS OF TOP ADVISORS

THE NEXT AD BIDDING GUIDE AN EASY GUIDE TO HELP YOU OPTIMISE YOUR BIDDING STRATEGY

CS 40 Computing for the Web

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

CPS221 Lecture: Cloud Computing last revised 10/22/14 Objectives

Collecting Polish German Parallel Corpora in the Internet

Can you briefly describe, for those listening to the podcast, your role and your responsibilities at Facebook?

INDEX. Introduction Page 3. Methodology Page 4. Findings. Conclusion. Page 5. Page 10

Courtesy of: VREB Virtual Real Estate Brokerage

ARE YOU SPENDING YOUR PPC BUDGET WISELY? BEST PRACTICES AND CREATIVE TIPS FOR PPC BUDGET MANAGEMENT

EAS Basic Outline. Overview

Sentimental Analysis using Hadoop Phase 2: Week 2

Last time we had arrived at the following provisional interpretation of Aquinas second way:

A free guide for readers of Double Your Business. By Lee Duncan Your Business.com

Effective Monetization of Music on Mobile

Supply Chain Management 100 Success Secrets

Text Analytics with Ambiverse. Text to Knowledge.

Busn 135 Syllabus. Business Math using Excel. (Syllabus subject to change)

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Social Media Mining. Data Mining Essentials

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

GRAPHIC DESIGN 1, ART Spring Semester, 2014 Washington Hall, 1st Floor, Lab/Room 158 Mondays and Wednesdays, 5:20 p.m. 7:50 p.m.

It s Time to Write Your Business Plan By Jim Mulligan

Web Design & Development

Statistical Machine Translation: IBM Models 1 and 2

How to Meet EDI Compliance with Cloud ERP

Office: LSK 5045 Begin subject: [ISOM3360]...

Strategies For Setting Up Your Organisation For Success With Big Data. Kevin Long Business Development Director Teradata

P1: OTA/XYZ P2: ABC c01 JWBT043/Goins December 4, :53 Printer Name: Courier Westford, Westford, MA SECTION I

IBM Global Business Services Microsoft Dynamics CRM solutions from IBM

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

How to make the most of ebay Motors.

The Need for Training in Big Data: Experiences and Case Studies

Tricks To Get The Click

"Breakthrough New Software Automates The Optimization Process To Get You A #1 Ranking - All With The Single Click Of A Button!"

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

IMPORTANT NOTICE. This syllabus is provided only as an example of what you might find in my sixteen-week lecture course.

Brainstorm a bit with friends and colleagues and add in these ideas. You'll have thousands of keywords in a very short period of time.

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

SCM, CRM, BI, and ICE

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Machine Learning using MapReduce

Social Media 101. The Basics of Social Media

Technical challenges in web advertising

Introduction. Principle 1: Architects focus on what is essential. A Pragmatic View on Enterprise Architecture

The key to knowing the best price is to fully understand consumer behavior.

Marketing for Martial Arts Schools:

GUIDE TO GOOGLE ADWORDS

CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies

Strategies for Effective Tweeting: A Statistical Review

Can people find your business online easily?

To reduce or not to reduce, that is the question

1001ICT Introduction To Programming Lecture Notes

Using Data Mining and Machine Learning in Retail

PAY-PER-CLICK CALL TRACKING. How Call Tracking Data Can Improve & Optimize Your PPC Strategy

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Transcription:

Big Data and Scripting (lecture, computer science, bachelor/master/phd)

Big Data and Scripting - abstract/organization abstract introduction to Big Data and involved techniques lecture (2+2) practical exercises to be turned in dates 2 lectures (Mon 1:30 pm, M628 and Thu 10 am G302) 2 lab courses (Fri 10:00 am and 1:30 pm in Z613) oral exam, end of semester me Uwe Nagel uwe.nagel@uni-konstanz.de

Big Data and Scripting - organizational stuff exercises website: http://www.inf.uni-konstanz.de/algo/lehre/ss13/bds/ (about) 3 projects (bash, R, NOSQL/Hadoop) programming skills usefull, but not required discussion and help in lab course (Friday)

agenda - contents of this lecture prologue: What is Big Data and why bother? concrete examples identify qualitatively what sets Big Data approaches apart tools and techniques for (distributed) computation (some) basic notions of data handling Unix command line scripting in R NOSQL by example the map/reduce paradigm (example: Hadoop)

What this lecture does not cover basics of data mining we are using some dm-techniques this is not a data mining course lecture Data Mining: Artificial Intelligence recommender systems we will touch those without detail seminar/lecture Recommender Systems

Prologue what does Big Data mean and why is that interesting Big Data and distributed computing seems like a fashion is there really an advantage? where does this advantage come from? 3 example applications increasing level of detail

What is Big Data and why bother? a simple example - Amazon basically a selling platform provides: connection of suppliers to (private) customers a common market place (one interface for all) additional services (storage, shipment, payment) recommendation what is the difference to competitors? Amazon knows customers, products, sales and views same is true for its competitors

What is Big Data and why bother? in comparison, Amazon has much more customers more customers, more transactions, more views a larger data collection better recommendations estimate 1 : 1/3 of Amazon s sales generated by recommendations more data = better predictions? simple answer: essentially yes real answer: it s a bit more complicated 1 www.economist.com/blogs/graphicdetail/2013/02/ elusive-big-data

What is Big Data? - extraction from examples what are we trying to find out? learning/data mining and artificial intelligence are not that new somehow huge amounts of data can make a difference question: how and why? approach: analyze examples using big data 1. where is the big data 2. what kind of data is involved 3. what makes a large data base crucial

Target and the pregnant teen Target a large discounter chain (similar to Walmart) uses data analysis for targeted marketing central to one of the most famous big data stories the story Target predicts pregnancy better than family members source: www.forbes.com/sites/kashmirhill/2012/02/16/ how-target-figured-out-a-teen-girl-was-pregnant-before-her-fathe

Target and the pregnant teen - How? in a nutshell collect data about customers predict what they are interested in adjust advertisement to the specific person

Target and the pregnant teen - How? 1. step: data collection create large base of data available about customers each customer gets some unique ID (credit card, email,... ) everything that can be connected to the customer is collected connected to customer ID used for interest prediction example of data to collect items purchased together time/place of purchase weather? - whatever can be collected

Target and the pregnant teen - How? next: search for patterns simple: people buy what they always bought recommendation: customers who bought this usually also buy... concrete targeting, example: young parents a new child is a perfect opportunity: parents have to buy a lot of stuff (without having too much money) at this stage they are more likely bound to brands prediction of pregnancy is crucial for advertisement

Target and the pregnant teen - How? remark: this is how one could do it, not necessary how it was done. ground truth? customers are described by their purchases goal: identify patterns typical for pregnant women first steps: identify purchase records of pregnant women (i.e. positive label, group P) of non-pregnant customers (i.e. negative label, group N) searching for hints find commonalities within P find features distinguishing P from N build predictor: P(c P) (it is unknown, how exactly Target is doing this)

Target and the pregnant teen - results identified patterns quoting a Target analyst: they identified 25 products when analyzed together these allow a pregnancy prediction score P(c P) example: pregnant women buy supplements like calcium, magnesium and zinc sometime in the 20 first weeks business impact start of program: 2002 revenue growth: $44 Billion (2002) $67 Billion (2010) it is assumed that data mining was crucial for this growth

a second example: machine translation the task automatic translation of text given: text T in language A result: text T in language B example: Google s translator URL: http://translate.google.de/

machine translation: a naive approach word mappings hold a dictionary W : A B replace each w T by W (w) 1. problem: words don t match exactly between languages 2. problem: grammar learning grammar 1. problem: grammar is hard, especially with semantics mixed in c.f. Chomsky s hierarchy of grammars 2. problem: language is noisy

machine translation: a statistical approach learning from big data new approach: don t understand or analyze instead: translation by example examples are taken from a corpus of manually translated documents basic idea (roughly) learn probability P that T is translation of T find T with maximal P approach: breaking down probabilities note: the following explains the principle and is not correct in every detail

machine translation: breaking down probabilities example: translate french text F to english text E P(E F ) - prob. that E is correct translation of F let F = f 1 f 2... (f i sentence, E analogous) first splitting assumption: f 1 corresponds to e i E is correct, if each e i translates its f i P(E F ) = P(e i f i ) i try to maximize P(e i f i )

machine translation: breaking down probabilities consider a concrete pair of sentences: Je ne vous connais pas. I don t know you. Je - I vous - you connais - know ne... pas - don t some observations words are translated (Je I) some words change place (vous you) some words change number (e.g. ne... pas don t)

machine translation: breaking down probabilities formalize our observations into concrete probabilities: translation P(f e) f is translation of e (Je I) distortion P(t s, l) word at position t is replaced (you nous) by word at position s in sentence of lengt fertility P(n e) e is replaced by n french words (ne pas don t)

machine translation: breaking down probabilities how does this help for P(E F )? recall assumption P(E F ) = i P(e i f i ) P(E F ) is high, if every P(e i f i ) is high same principle can be applied on the sentence level breaking up sentences P(f i, e i ) has many parts translation, distortion, fertility for every word some more, unknown combination by product (assuming independence) P(f i, e i ) 1, if all the parts 1 use translation, distortion, fertility as indicators

machine translation: missing data/open questions how are partial probabilities determined? estimation by observation recall: translation by example derive approximate probabilities by counting in corpus what is left basis: large corpus of translated documents additional: matching of sentences, words not considered here, further information: http://www.mt-archive.info/

discussion why does this work? it does not (translate a text into your native language and you ll see) translate.google.com still the quality of the results is surprising does it scale? why is it not always correct? what would be the impact of adding more data? can it be parallelized?