Big Data and Scripting 1,
2, Big Data and Scripting - abstract/organization contents introduction to Big Data and involved techniques schedule 2 lectures (Mon 1:30 pm, M628 and Thu 10 am F420) 2 tutorials (Fri 10:00 am and 1:30 pm in F420) attend one tutorial written exams (July 30, October 14) lecture: Uwe Nagel (uwe.nagel@uni-konstanz.de) tutorials: Mark Ortmann
Big Data and Scripting - organizational stuff communication website: http://www.inf.uni-konstanz.de/algo/lehre/ss14/bds/ assignment sheets slides of the lecture announcements register for this lecture in the LSF! (lsf.uni-konstanz.de) assignments mostly implementation tasks involve languages/algorithms covered in the lecture discussion and help in tutorials requirement for exams: 50% of assignment points 3,
4, agenda - contents of this lecture tools and techniques for (distributed) computation Unix command line (i.e. bash scripting) Python (machine learning) NOSQL by example (databases) map/reduce (distributed computing) algorithms for (distributed) computation streaming algorithms memory hierarchies and distributed storage distributed and parallel algorithms
5, What this lecture does not cover data mining using some data mining algorithms this is not a data mining course lecture Data Mining: Artificial Intelligence recommender systems mentioned, no detailed coverage again, not a primary topic of this lecture
6, Prologue what does Big Data mean and why is that interesting questions: Is Big Data just another buzz word? Is there an actual meaning or advantageous technique? If so, where does that come from or what exactly is it? remainder of this lecture: 3 example applications increasing level of detail
7, Example 1: Amazon a simple example - Amazon basically a selling platform provides: connection of suppliers to customers a common market place (one interface for all shops) additional services (storage, shipment, payment, search) recommendation what is the difference to competitors? Amazon knows customers, products, sales and views same is true for its competitors
8, Example 1: Amazon in comparison, Amazon has much more customers more customers, more transactions, more views a larger data collection better recommendations estimate 1 : 1/3 of Amazon s sales generated by recommendations more data = better predictions? simple answer: essentially yes real answer: it s a bit more complicated 1 www.economist.com/blogs/graphicdetail/2013/02/ elusive-big-data
9, Intermission what are we trying to find out? learning/data mining and artificial intelligence are not that new somehow huge amounts of data can make a difference question: how and why? approach: analyze examples using big data 1. where is the big data 2. what kind of data is involved 3. what makes a large data base crucial
10, Example 2: Target and the pregnant teen Target a large discounter chain (similar to Walmart/Aldi) uses data analysis for marketing the story 2 Target advertises baby equipment to teen father complains, finds out that his daughter is actually pregnant proves that it predicts pregnancy better than family members 2 www.forbes.com/sites/kashmirhill/2012/02/16/ how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-
11, Example 2: Target and the pregnant teen - How? in a nutshell collect data about customers (who buys what, when) predict what they are interested in adjust advertisement to the specific person
12, Example 2: Target and the pregnant teen - How? 1. step: data collection create large base of data available about customers each customer gets some unique ID (credit card, email,... ) everything that can be connected to the customer is collected connected to customer ID used for interest prediction example of data to collect items purchased together time/place of purchase weather? - whatever can be collected
13, Example 2: Target and the pregnant teen - How? next: search for patterns simple: people buy what they always bought recommendation: customers who bought this usually also buy... concrete targeting, example: young parents a new child is a perfect opportunity: parents have to buy a lot of stuff (without having too much money) at this stage they are more likely bound to brands prediction of pregnancy is crucial for advertisement
Example 2: Target and the pregnant teen How? data gathering customers are described by their purchases items, time, payment method,... products can be described by purchases families products are mostly bought on weekends given enough records, patterns emerge typical purchase histories (as in the pregnancy example) typical customers (as in I always buy beer and chips ) new products that become popular vs. products that are ignored these patterns can be very complicated more data leads to more opportunities (e.g. more complicated 14,
15, Example 2: Target and the pregnant teen results patterns in the example quoting a Target analyst: they identified 25 products when analyzed together these allow a pregnancy prediction score example: pregnant women buy supplements like calcium, magnesium and zinc sometime in the 20 first weeks business impact start of program: 2002 revenue growth: $44 Billion (2002) $67 Billion (2010) it is assumed that data mining was crucial for this growth
16, Example 3: machine translation the task automatic translation of text given: text T in language A result: text T in language B example: Google s translator URL: http://translate.google.de/
17, Example 3: machine translation a naive approach word mappings hold a dictionary W : A B replace each w T by W (w) 1. problem: words don t match exactly between languages neither in meaning, nor in number 2. problem: grammar grammar is hard 1. problem: language is noisy, grammar is often not exact 2. problem: even exact grammar is hard (semantics, context) Chomsky hierarchy, theory of computer science
Example 3: machine translation a statistical approach new approach: translation by example model with very few assumptions and simple rules rules expressed by probabilities examples are taken from a corpus of manually translated documents basic idea model training: learn probability P that some text T is translation of text T Translation: find T with maximal P note: the following explains the principle and is not correct in every detail 18,
19, Example 3: machine translation breaking down probabilities example: translate french text F to english text E P(E F ) - prob. that E is correct translation of F let F = f 1 f 2... (f i sentence, E analogous) first splitting assumption: f i corresponds to e i E is correct, if each e i translates f i independent of other sentences P(E F ) = P(e i f i ) i try to maximize P(e i f i ) for all i independently find most probable translation sentence
20, Example 3: machine translation breaking down probabilities consider a concrete pair of sentences: Je ne vous connais pas. I don t know you. Je - I vous - you connais - know ne... pas - don t some observations words are translated (Je I) some words change place (vous you) some words change number (ne... pas don t)
Example 3: machine translation breaking down probabilities Je ne vous connais pas. I don t know you. formalize our observations into concrete probabilities: translation P(f e) f is translation of e (Je I) distortion P(t s, l) word at position t is replaced (you nous) by word at position s in sentence of lengt fertility P(n e) e is replaced by n french words (ne pas don t) 21,
Example 3: machine translation breaking down probabilities how does this help for P(E F )? recall assumption P(E F ) = i P(e i f i ) P(E F ) is high, if every P(e i f i ) is high same principle can be applied on the sentence level breaking up sentences P(f i, e i ) has many parts translation, distortion, fertility for every word some more, unknown combination by product (assuming independence) P(f i, e i ) 1, if all the parts 1 use translation, distortion, fertility as indicators 22,
23, Example 3: machine translation wrap up assumption: translation can be broke down to simple probabilities learning: estimate individual probabilities translation: find most probable sentences why is this a Big Data application? individual probabilities are learned from real, manually translated texts this is an individual problem for each pair of languages many individual probabilities have to be determined can be estimated from example texts (more texts, better estimates) quality grows with additional knowledge (texts) fed into the system
24, Example 3: machine translation missing data/open questions data sources many books translated into various languages European laws are translated into all European languages derive approximate probabilities by counting in corpus open matching of sentences and words finding the actual translation (we only have partial probabilities) further information: http://www.mt-archive.info/
25, discussion why does this work? it works only partially test for yourself: translate.google.com still, the quality of the results is surprising does it scale? why is it not always correct? what would be the impact of adding more data? can it be parallelized?
26, the notion Big Data no agreement on a definition usual understanding: methods that involve machine learning/data mining necessarily involve massive amounts of data improve with additional input
27, the remainder of this lecture some applications necessarily involve massive amounts of data this lecture is about handling such data we are (almost) not interested in the actual application we are interested in means to enable these applications two main points of interest practical scripting, command lines, programming theoretical algorithms for data storage, and handling basic algorithms for large data sets (e.g. sorting) next lecture: basics on the command line