Big Data how it changes the way you treat data Oct. 2013 Chung-Min Chen Chief Scientist Info. Analysis Research & Services The views and opinions expressed in this presentation are those of the author and do not necessarily reflect the position of the company. 1 2012 Applied Communication 2012 Applied Communication Sciences. Sciences. All Rights All Rights Reserved. Reserved.
About ACS Company history Bellcore (Applied Research), 1985-1999 Telcordia (Advanced Technology Solutions), 1999-2012 Ericsson 2012-1013 Big data R&D Stream Tribeca: A Stream Database Manager for Network Traffic Analysis. VLDB96 Latent semantic indexing Telecom: CDR/Subscriber reconciliation, Service Assurance 2
Hope or Hype? 3
Hope or Hype? Big data will change* The way you live The way you work The way you think N Big data is Big Bubble? remember.com, Web 2.0? The hype cycle t * Big Data: A Revolution That Will Transform How We Live, Work, and Think, Mayer-Schonberger, K. Cukier. 4
big data on Google Trends 5 5
Has big data reached its hype peak? source:kdnuggets.com * bar height in proportion to number of votes 6
4 V s of Big Data Big data is data whose scale, diversity, and/or timeliness requires new architectures and analytics to unlock business value. EMC 2 --- EMC 2 datasciencentral.com 7
Big Data Definition Revisited Data that is expensive to manage, and hard to extract value from UCB AMP Lab Too big, expensive and too hard to handle! --MIT source: ORACLE 8
Big data is not about data size, it s about the new thinkings of how to treat data. 9
Big Data Technologies OLAP Mining Learning Visualization NoSQL Parallel Programming Distributed FS Analytics Platform Value Variety Veracity Volume Velocity 10
Quantity change leads to quality change Passiveness leads to fidelity Past: volunteers + questionnaire Observer Effect Now: big data + analysis Scrutiny leads to discovery Sampling shortfalls: random is hard, lacks details, missing targets 11
accuracy Machine Translation Linguistic Model dictionary, grammar rule-based Statistical Model Digest bilingual text corpus Pattern match-based How to improve accuracy Improve existing algorithms Develop new algorithms Increase training size (text corpus) training size 12
Machine Translation Linguistic Model dictionary, grammar rule-based Statistical Model Digest bilingual text corpus Pattern match-based How to improve accuracy Improve existing algorithms Develop new algorithms Increase training size (text corpus) 松 下 問 童 子 Panasonic asked the boy Panasonic asked the lad 小 心 墜 河 Carefully fall into the river Carefully zhuihe 13
Elections Obama big data team Targeted fund raising Social network based 拉 票 催 票 Targeted TV advertisement Big data-based prediction Nate Silver vs. Washington elite Big data vs. phone polls c - Inside the Secret World of Quants and Data Crunchers Who Helped Obama Win, TIME Magazine, Nov. 7, 2012. - How Vertica Was the Star of the Obama Campaign, and Other Revelations, www.citoresearch.com, Jan. 16, 2013. 14
Linguistics Research 500M Tweets per day Study of language evolution Example findings Old :-), young :) Stanford Univ. Young: expressive lengthening Coooool Univ. of Twente Women like to use I,!!! Predict gender 75% Mitre Challenges Biased towards young, urban Nonstandard speech, Ima call #mybf now ``The Linguist s Mother Lode. What Twitter reveals about slang, gender and no-nose emoticons, TIME, Sep. 9, 2013. 15
2. Correlation prevails Causality 知 其 然 而 不 知 其 所 以 然 Knowing correlation is good enough Predicting without explanation Causality is hard, sometimes impossible, to verify High-voltage station/towers cause cancer? Base stations cause cancer? Frequent mobile phone usage causes cancer? 16
Doctors vs. Computers who do you trust? ER Crisis at Cook County Hospital, 1996 Flooded with chest pain patients Who should be admitted (i.e. having real heart attack)? Standard manual procedure BP, stethoscope, questions, ECG 90% admitted are false positive; 83% recall admitted having heart attack Blink: the power of thinking without thinking, M. Gladwell. Goldman L, Cook EF, Brand DA et al. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N Engl J Med 1988; 318 (13):797-803 17
Doctors vs. Computers who do you trust? 3-level decision tree (a) Unstable angina pain? (b) Fluid in lung? (c) Systolic BP < 100? Results False positives < %30 (vs. >90% by doctors) Recall > 95% (vs. 83% by doctors) Yes b a No b c c c c admitted having heart attack 18
Less is More: feature extraction Other features seem to be insignificant Age Job: pressure, hours Exercise High BP history Weight Heart disease Sweating 19
2. 知 其 然 而 不 知 其 所 以 然 (cont.) Correlation prevails Causality Knowing correlation is good enough well, not all the time Mechanical causality Bayesian network Data provenance Explain what I found 20
Data Provenance Courtesy of Prof. Renee Miller, Univ. of Toronto 21
2. 知 其 然 而 不 知 其 所 以 然 (cont.) Correlation prevails Causality Knowing correlation is good enough well, not all the time Be careful not to ignore causality for all Crowded parking lots higher sales Orange cars less defect 22
Issues Privacy Notice and consent (Target) Opt out (Google) Anonymization (Netflix) Societal impact Act before it happens Big data divide 23
Recap and Trends 24