Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century - Davenport and Patil, Harvard Business Review 2012 Why is data science booming as a field? Lyle Ungar, University of Pennsylvania
Everybody is selling big data 2
What is Big Data? high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner) 1 Terabyte = 1024 Gigabytes 1 Petabyte = 1024 Terabytes 1 Exabyte = 1024 Petabytes 1 Zettabyte = 1024 Petabytes 3
Who does big data? IBM, HP, EMC, Teradata, Oracle, SAP, Microsoft, AWS, Vmware, Google Splunk, MemSQL, Palantir, Trifacta, Datameer, Neo, DataStax, Infobright, Fractal Analytics http://www.datamation.com/applications/30-big-data-companies-leading-the-way-1.html 4
Big Data u Big n vs. big p u How is big data different? l use available large-scale data rather than hoping for annotated data u Semi-parametric or non-parametric methods 5
Different methods work best at scale u Confusion set disambiguation l Choose the correct word in the set given the context {principle, principal} {then, than} (to, two, too} {weather, whether} 6
The unreasonable effectiveness of data 7
How to handle big data? u Dimensionality reduction u Sampling u Hadoop/MapReduce 8
Big Data: different philosophy Different data handling: u Mostly unstructured data objects (Schema-less NoSQL) u Vast number of attributes and data sources u Data sources added and/or updated frequently u Quality is unknown Different programming philosophy: u Distributed Programming u Fault Tolerant External References http://developer.yahoo.com/hadoop/ http://code.google.com/edu/parallel/mapreduce-tutorial.html 9
What is the slowest part of big data analysis? A) Multiplying X X B) Inverting a matrix (X X) -1? C) Reading from X from disk to memory? D) Other? 10
Data Parallelism 11
Model Parallelism 12
Map-Reduce Dataflow Data is divided across multiple machine ( mappers ) Each mapper does the same thing to different data Results are combined ( reduced ) http://codecapsule.com/ 2010/04/15/efficient-countingmapreduce/ 13
How easy is it to do in map-reduce? u Linear regression u Linear regression with feature selection u SVM u k-nn u K-means / EM A) Easy B) Hard C) Impossible 14
Good tools u LDA l Mallet l Factorie u Deep Nets l Theano l Caffe, Torch l Tensorflow? 15
Hadoop Slide from cloudera 16
In Hadoop u Hive l data warehouse infrastructure providing data summarization, query, and analysis. u Pig, Crunch l high-level platform for creating MapReduce programs used with Hadoop, pipelining code u Mahout u Solr l.scalable machine learning and data mining l u Hue enterprise search platform built on Apache Lucene l visualization 17
Spark u Combines SQL, streaming, and complex analytics u Often runs on Hadoop l or Mesos, standalone, or in the cloud u Bindings to l Java, Scala, Python, R l NLTK l u MLlib Machine Learning Library l Faster than Mahout Seems to be replacing Hadoop 18
Increasingly use a deep stack 19
Increasing in the cloud u X as a Service l SaaS (software) l PaaS (platform) l IaaS (infrastructure) u It s easy to spin these up on AWS l Or MS Azure 20
Tools are changing rapidly u Currently hot: l SMACK: Spark, Mesos, Akka, Cassandra and Kafka l Deep Net package of the week But the fundamental we learned in this class are not changing! 21