The? Data: Introduction and Future Husnu Sensoy Global Maksimum Data & Information Technologies
Global Maksimum Data & Information Technologies The Data Company Massive Data Unstructured Data Insight Information Action Dark Data
Concepts Guide BIG DATA
Big Data Problem with Terminology Our industry is known to be one of the worst ones in naming things NoSQL à Not only SQL Grid Computing à Cluster Computing à Distributed Computing Massive Data is already named. Others are on the way Volume, Variety, Velocity, Veracity, Value, How many of them do you really need?
Big Data DWH/BI vs. Big Data Big Data Conventional Analytics Global Maksimum Comment Data Type Unstructured Column-Row Major Format Not fully agreed Volume 100 TB 1 PB Data Delivery The way to analyze Primary Purpose Fast & Nonstop Machine Learning Data Products Less than 100 TB Static, batch, or mini-batch Hypothesis Based Decision Support & Related Services Not fully agreed Almost agreed Almost agreed
Big Data How to think about it Big Data refers to things one can do at large scale that cannot be done at a smaller one, to extract new insights or create new forms of value,
Big Data We have invented nothing 17000 years Lascaux Caves We have invented nothing
Big Data Not about amount of data you have It does not matter how high quality horse pictures you can draw or how many of them you can draw in 17000 years. What matters is drawing 25 horse pictures in a second and having a running horse movie out of them. Peter Norvig (Director of Research at Google Inc.)
Big Data Curse of Sampling What is the probability that the sun will rise tomorrow? Problem with micro-segmentation?
Skills BIG DATA
Data Scientist (Not) A New Job Title A statistician having the best programming skills among other statisticians. A programmer (hacker) having the best statistics understanding among other hackers.
Data Scientist Open Education Initiative Previously non-trivial Now become really easy
Tools BIG DATA
Hadoop Components/Projects in Essence Hadoop Distributed File System (HDFS ) Hadoop MapReduce: Pig Hive Spark Cassandra More stable, better performing and supported alternatives are available Red Hat GlusterFS A programming model that is not for productivity Renamed version of old scatter-gather paradigm Started to be replaced by Spark Mahout Distributed wide-columnar storage well supported by Datastax
MapReduce Yet another over-marketing Too low-level Programming models/languages are for humans not machines. From many perspectives C++ is a better language than Java. But Not possible to use really without higher level abstractions like Pig, Hive or Only simple example available in all slides is Word Frequency Counting Lack of Novelty All well-known MPP databases using it for decades Limited interactivity
Hadoop How do large size users use it? Unstructured to structured conversion ETL Data Acquisition and Pipelining Long term data retention
Relational Databases We still do need them New trends overload Hadoop SQL (Mathematically sound set-based processing) is still the best way for many data related operations. SQL is also extensible and easy to scale by UDFs.
Vertica Good old SQL on High Performing new Engine Who invented it? Michael Stonebraker PostgreSQL Ingres Vertica Streambase VoltDB Big Data Customers Facebook Twitter Zynga Global Maksimum CSA runs on Vertica 4.5 TB of data processed daily to label customer mobile use 80% SQL statements running on engine is for data mining
Hadoop vs. Vertica (or RDBMS in General) Facebook We use Vertica for things we make money and Hadoop for other things Tim Campos (Facebook CIO) Facebook loads 35 TB into Vertica every hour
Data Science BIG DATA
Machine Learning Toolkit Adhoc to Systematic Advantages Supported by many academic groups Recently by many companies as a part of their solutions Simple learning curve Standard and well-documented Used and supported by all valley companies Disadvantages Steep learning curve due to nonstandardization Limited scalability (still) Same problems with others for large size implementations Start to be used by valley companies. Scalable given that your algorithms allow that Human resource All known problems with Hadoop Ready to use Easy to integrate with existing applications Limited with recommendation algorithms No fine grain control on algorithms Limited by your skillset. Used by all valley companies Human resource
HP Distributed-R Scalable R Distributed implementations of standard R data structures. Same interfaces Requires distributed programing skills
Deep Learning Neural Networks are back Neural Networks are always known to have mathematical poor basis. For years SVM (and similar algorithms) beats them in accuracy. But things have changed recently Structure learning Image processing Natural Language Processing
GLOBAL MAKSiMUM