The? Data: Introduction and Future



Similar documents
Transforming the Telecoms Business using Big Data and Analytics

Big Data and Data Science: Behind the Buzz Words

Big Data Explained. An introduction to Big Data Science.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How To Handle Big Data With A Data Scientist

Big Data. Lyle Ungar, University of Pennsylvania

Hadoop and Map-Reduce. Swati Gore

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA What it is and how to use?

Hadoop. Sunday, November 25, 12

The 4 Pillars of Technosoft s Big Data Practice

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

HDP Hadoop From concept to deployment.

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Integrating a Big Data Platform into Government:

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big data for the Masses The Unique Challenge of Big Data Integration

Real Time Big Data Processing

Cost-Effective Business Intelligence with Red Hat and Open Source

COMP9321 Web Application Engineering

Big Data and Analytics: Challenges and Opportunities

Hadoop IST 734 SS CHUNG

How To Scale Out Of A Nosql Database

ANALYTICS CENTER LEARNING PROGRAM

Big Data: Beyond the Hype

Data Analytics Infrastructure

BIG DATA CHALLENGES AND PERSPECTIVES

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Advanced Big Data Analytics with R and Hadoop

Workshop on Hadoop with Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

How To Learn To Use Big Data

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

I/O Considerations in Big Data Analytics

Big Data on Microsoft Platform

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building Scalable Big Data Pipelines

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Information Builders Mission & Value Proposition

Upcoming Announcements

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HDP Enabling the Modern Data Architecture

Ali Ghodsi Head of PM and Engineering Databricks

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

So What s the Big Deal?

Large scale processing using Hadoop. Ján Vaňo

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Cisco IT Hadoop Journey

Big Data Can Drive the Business and IT to Evolve and Adapt

Big Data Analytics Nokia

Open source Google-style large scale data analysis with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

The Internet of Things and Big Data: Intro

BIG DATA SOLUTION DATA SHEET

BIRT in the World of Big Data

Please give me your feedback

Big Data and Data Science. The globally recognised training program

Hadoop implementation of MapReduce computational model. Ján Vaňo

What Does Big Data Mean and Who Will Win? Michael Stonebraker

Il mondo dei DB Cambia : Tecnologie e opportunita`

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

SEIZE THE DATA SEIZE THE DATA. 2015

CSE-E5430 Scalable Cloud Computing Lecture 2

Moving From Hadoop to Spark

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Transcription:

The? Data: Introduction and Future Husnu Sensoy Global Maksimum Data & Information Technologies

Global Maksimum Data & Information Technologies The Data Company Massive Data Unstructured Data Insight Information Action Dark Data

Concepts Guide BIG DATA

Big Data Problem with Terminology Our industry is known to be one of the worst ones in naming things NoSQL à Not only SQL Grid Computing à Cluster Computing à Distributed Computing Massive Data is already named. Others are on the way Volume, Variety, Velocity, Veracity, Value, How many of them do you really need?

Big Data DWH/BI vs. Big Data Big Data Conventional Analytics Global Maksimum Comment Data Type Unstructured Column-Row Major Format Not fully agreed Volume 100 TB 1 PB Data Delivery The way to analyze Primary Purpose Fast & Nonstop Machine Learning Data Products Less than 100 TB Static, batch, or mini-batch Hypothesis Based Decision Support & Related Services Not fully agreed Almost agreed Almost agreed

Big Data How to think about it Big Data refers to things one can do at large scale that cannot be done at a smaller one, to extract new insights or create new forms of value,

Big Data We have invented nothing 17000 years Lascaux Caves We have invented nothing

Big Data Not about amount of data you have It does not matter how high quality horse pictures you can draw or how many of them you can draw in 17000 years. What matters is drawing 25 horse pictures in a second and having a running horse movie out of them. Peter Norvig (Director of Research at Google Inc.)

Big Data Curse of Sampling What is the probability that the sun will rise tomorrow? Problem with micro-segmentation?

Skills BIG DATA

Data Scientist (Not) A New Job Title A statistician having the best programming skills among other statisticians. A programmer (hacker) having the best statistics understanding among other hackers.

Data Scientist Open Education Initiative Previously non-trivial Now become really easy

Tools BIG DATA

Hadoop Components/Projects in Essence Hadoop Distributed File System (HDFS ) Hadoop MapReduce: Pig Hive Spark Cassandra More stable, better performing and supported alternatives are available Red Hat GlusterFS A programming model that is not for productivity Renamed version of old scatter-gather paradigm Started to be replaced by Spark Mahout Distributed wide-columnar storage well supported by Datastax

MapReduce Yet another over-marketing Too low-level Programming models/languages are for humans not machines. From many perspectives C++ is a better language than Java. But Not possible to use really without higher level abstractions like Pig, Hive or Only simple example available in all slides is Word Frequency Counting Lack of Novelty All well-known MPP databases using it for decades Limited interactivity

Hadoop How do large size users use it? Unstructured to structured conversion ETL Data Acquisition and Pipelining Long term data retention

Relational Databases We still do need them New trends overload Hadoop SQL (Mathematically sound set-based processing) is still the best way for many data related operations. SQL is also extensible and easy to scale by UDFs.

Vertica Good old SQL on High Performing new Engine Who invented it? Michael Stonebraker PostgreSQL Ingres Vertica Streambase VoltDB Big Data Customers Facebook Twitter Zynga Global Maksimum CSA runs on Vertica 4.5 TB of data processed daily to label customer mobile use 80% SQL statements running on engine is for data mining

Hadoop vs. Vertica (or RDBMS in General) Facebook We use Vertica for things we make money and Hadoop for other things Tim Campos (Facebook CIO) Facebook loads 35 TB into Vertica every hour

Data Science BIG DATA

Machine Learning Toolkit Adhoc to Systematic Advantages Supported by many academic groups Recently by many companies as a part of their solutions Simple learning curve Standard and well-documented Used and supported by all valley companies Disadvantages Steep learning curve due to nonstandardization Limited scalability (still) Same problems with others for large size implementations Start to be used by valley companies. Scalable given that your algorithms allow that Human resource All known problems with Hadoop Ready to use Easy to integrate with existing applications Limited with recommendation algorithms No fine grain control on algorithms Limited by your skillset. Used by all valley companies Human resource

HP Distributed-R Scalable R Distributed implementations of standard R data structures. Same interfaces Requires distributed programing skills

Deep Learning Neural Networks are back Neural Networks are always known to have mathematical poor basis. For years SVM (and similar algorithms) beats them in accuracy. But things have changed recently Structure learning Image processing Natural Language Processing

GLOBAL MAKSiMUM