Big Data and Scripting (lecture, computer science, bachelor/master/phd)
Big Data and Scripting - abstract/organization abstract introduction to Big Data and involved techniques lecture (2+2) practical exercises to be turned in dates 2 lectures (Mon 1:30 pm, M628 and Thu 10 am G302) 2 lab courses (Fri 10:00 am and 1:30 pm in Z613) oral exam, end of semester me Uwe Nagel uwe.nagel@uni-konstanz.de
Big Data and Scripting - organizational stuff exercises website: http://www.inf.uni-konstanz.de/algo/lehre/ss13/bds/ (about) 3 projects (bash, R, NOSQL/Hadoop) programming skills usefull, but not required discussion and help in lab course (Friday)
agenda - contents of this lecture prologue: What is Big Data and why bother? concrete examples identify qualitatively what sets Big Data approaches apart tools and techniques for (distributed) computation (some) basic notions of data handling Unix command line scripting in R NOSQL by example the map/reduce paradigm (example: Hadoop)
What this lecture does not cover basics of data mining we are using some dm-techniques this is not a data mining course lecture Data Mining: Artificial Intelligence recommender systems we will touch those without detail seminar/lecture Recommender Systems
Prologue what does Big Data mean and why is that interesting Big Data and distributed computing seems like a fashion is there really an advantage? where does this advantage come from? 3 example applications increasing level of detail
What is Big Data and why bother? a simple example - Amazon basically a selling platform provides: connection of suppliers to (private) customers a common market place (one interface for all) additional services (storage, shipment, payment) recommendation what is the difference to competitors? Amazon knows customers, products, sales and views same is true for its competitors
What is Big Data and why bother? in comparison, Amazon has much more customers more customers, more transactions, more views a larger data collection better recommendations estimate 1 : 1/3 of Amazon s sales generated by recommendations more data = better predictions? simple answer: essentially yes real answer: it s a bit more complicated 1 www.economist.com/blogs/graphicdetail/2013/02/ elusive-big-data
What is Big Data? - extraction from examples what are we trying to find out? learning/data mining and artificial intelligence are not that new somehow huge amounts of data can make a difference question: how and why? approach: analyze examples using big data 1. where is the big data 2. what kind of data is involved 3. what makes a large data base crucial
Target and the pregnant teen Target a large discounter chain (similar to Walmart) uses data analysis for targeted marketing central to one of the most famous big data stories the story Target predicts pregnancy better than family members source: www.forbes.com/sites/kashmirhill/2012/02/16/ how-target-figured-out-a-teen-girl-was-pregnant-before-her-fathe
Target and the pregnant teen - How? in a nutshell collect data about customers predict what they are interested in adjust advertisement to the specific person
Target and the pregnant teen - How? 1. step: data collection create large base of data available about customers each customer gets some unique ID (credit card, email,... ) everything that can be connected to the customer is collected connected to customer ID used for interest prediction example of data to collect items purchased together time/place of purchase weather? - whatever can be collected
Target and the pregnant teen - How? next: search for patterns simple: people buy what they always bought recommendation: customers who bought this usually also buy... concrete targeting, example: young parents a new child is a perfect opportunity: parents have to buy a lot of stuff (without having too much money) at this stage they are more likely bound to brands prediction of pregnancy is crucial for advertisement
Target and the pregnant teen - How? remark: this is how one could do it, not necessary how it was done. ground truth? customers are described by their purchases goal: identify patterns typical for pregnant women first steps: identify purchase records of pregnant women (i.e. positive label, group P) of non-pregnant customers (i.e. negative label, group N) searching for hints find commonalities within P find features distinguishing P from N build predictor: P(c P) (it is unknown, how exactly Target is doing this)
Target and the pregnant teen - results identified patterns quoting a Target analyst: they identified 25 products when analyzed together these allow a pregnancy prediction score P(c P) example: pregnant women buy supplements like calcium, magnesium and zinc sometime in the 20 first weeks business impact start of program: 2002 revenue growth: $44 Billion (2002) $67 Billion (2010) it is assumed that data mining was crucial for this growth
a second example: machine translation the task automatic translation of text given: text T in language A result: text T in language B example: Google s translator URL: http://translate.google.de/
machine translation: a naive approach word mappings hold a dictionary W : A B replace each w T by W (w) 1. problem: words don t match exactly between languages 2. problem: grammar learning grammar 1. problem: grammar is hard, especially with semantics mixed in c.f. Chomsky s hierarchy of grammars 2. problem: language is noisy
machine translation: a statistical approach learning from big data new approach: don t understand or analyze instead: translation by example examples are taken from a corpus of manually translated documents basic idea (roughly) learn probability P that T is translation of T find T with maximal P approach: breaking down probabilities note: the following explains the principle and is not correct in every detail
machine translation: breaking down probabilities example: translate french text F to english text E P(E F ) - prob. that E is correct translation of F let F = f 1 f 2... (f i sentence, E analogous) first splitting assumption: f 1 corresponds to e i E is correct, if each e i translates its f i P(E F ) = P(e i f i ) i try to maximize P(e i f i )
machine translation: breaking down probabilities consider a concrete pair of sentences: Je ne vous connais pas. I don t know you. Je - I vous - you connais - know ne... pas - don t some observations words are translated (Je I) some words change place (vous you) some words change number (e.g. ne... pas don t)
machine translation: breaking down probabilities formalize our observations into concrete probabilities: translation P(f e) f is translation of e (Je I) distortion P(t s, l) word at position t is replaced (you nous) by word at position s in sentence of lengt fertility P(n e) e is replaced by n french words (ne pas don t)
machine translation: breaking down probabilities how does this help for P(E F )? recall assumption P(E F ) = i P(e i f i ) P(E F ) is high, if every P(e i f i ) is high same principle can be applied on the sentence level breaking up sentences P(f i, e i ) has many parts translation, distortion, fertility for every word some more, unknown combination by product (assuming independence) P(f i, e i ) 1, if all the parts 1 use translation, distortion, fertility as indicators
machine translation: missing data/open questions how are partial probabilities determined? estimation by observation recall: translation by example derive approximate probabilities by counting in corpus what is left basis: large corpus of translated documents additional: matching of sentences, words not considered here, further information: http://www.mt-archive.info/
discussion why does this work? it does not (translate a text into your native language and you ll see) translate.google.com still the quality of the results is surprising does it scale? why is it not always correct? what would be the impact of adding more data? can it be parallelized?