Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Similar documents

Apache Hadoop: Past, Present, and Future

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Tap into Hadoop and Other No SQL Sources

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big + Fast + Safe + Simple = Lowest Technical Risk

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data: Are You Ready? Kevin Lancaster

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop and Map-Reduce. Swati Gore

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Can the Elephants Handle the NoSQL Onslaught?

INTRODUCTION TO CASSANDRA

The Future of Data Management

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Challenges for Data Driven Systems

Big Data and Apache Hadoop Adoption:

NoSQL for SQL Professionals William McKnight

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

How To Scale Out Of A Nosql Database

BIG DATA TRENDS AND TECHNOLOGIES

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

The Inside Scoop on Hadoop

Big Data on Microsoft Platform

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Big Data: Beyond the Hype

Protecting Big Data Data Protection Solutions for the Business Data Lake

How To Use Big Data For Telco (For A Telco)

Making Sense of Big Data in Insurance

The Enterprise Data Hub and The Modern Information Architecture

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Proact whitepaper on Big Data

The 3 questions to ask yourself about BIG DATA

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Data Warehouse design

Big Data: Beyond the Hype

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Business Intelligence for Big Data

Big Data Big Data/Data Analytics & Software Development

How To Handle Big Data With A Data Scientist

An Approach to Implement Map Reduce with NoSQL Databases

Implement Hadoop jobs to extract business value from large and varied data sets

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Architectures for Big Data Analytics A database perspective

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Dell In-Memory Appliance for Cloudera Enterprise

REAL-TIME BIG DATA ANALYTICS

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

High Performance IT Insights. Building the Foundation for Big Data

Big Data. Lyle Ungar, University of Pennsylvania

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data Explained. An introduction to Big Data Science.

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data Are You Ready? Thomas Kyte

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

CIO Guide How to Use Hadoop with Your SAP Software Landscape

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Splice Machine: SQL-on-Hadoop Evaluation Guide

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Big Data Challenges. Alexandru Adrian TOLE Romanian American University, Bucharest, Romania

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA What it is and how to use?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Hadoop implementation of MapReduce computational model. Ján Vaňo

Navigating the Big Data infrastructure layer Helena Schwenk

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Deploying an Operational Data Store Designed for Big Data

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

An Oracle White Paper June Oracle: Big Data for the Enterprise

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Open source large scale distributed data management with Google s MapReduce and Bigtable

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Report Data Management in the Cloud: Limitations and Opportunities

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Big Data and Data Science: Behind the Buzz Words

Big Data: Beyond the Hype. Why Big Data Matters to You. White Paper

Oracle Big Data SQL Technical Update

Large scale processing using Hadoop. Ján Vaňo

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop IST 734 SS CHUNG

Ubuntu and Hadoop: the perfect match

The Future of Data Management with Hadoop and the Enterprise Data Hub

Transcription:

Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology Development Agency (NSTDA)

Presentation outline Big Data Technology Overview, definition, motivation and properties BI vs. Data science Applications Big Data Tools Hadoop NoSQL MongoDB

What is Big Data?

Big Data: Motivation Source: http://www.esg-global.com/blogs/big-data-a-better-definition/

Structured vs. Unstructured Data Source: http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy

Structured vs. Unstructured Data Source: http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy

Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

7 Key Drivers for the Big Data Market Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

3Vs of Big Data

3Vs of Big Data

4Vs of Big Data

4Vs of Big Data

4Vs of Big Data

4Vs of Big Data

4Vs of Big Data

Where does Big Data come from? Source: http://www.ibmbigdatahub.com/infographic/where-does-big-data-come

Gartner s Hype Cycle for Big Data chart

Big Data Business Model Maturity Chart Source: https://infocus.emc.com/william_schmarzo/big-data-business-model-maturity-chart/

Big Data Business Model Maturity Chart Source: https://infocus.emc.com/william_schmarzo/big-data-business-model-maturity-chart/

Big Data Business Model Maturity Chart Source: Bill Schmarzo, Big Data: Understanding How Data Powers Big Business

Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics, http://reflectionsblog.emc.com/2014/08/business-intelligence-analyst-data-scientist-whats-difference/

Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics, http://reflectionsblog.emc.com/2014/08/business-intelligence-analyst-data-scientist-whats-difference/

Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics, http://reflectionsblog.emc.com/2014/08/business-intelligence-analyst-data-scientist-whats-difference/

Data Scientist Source: http://insidebigdata.com/2013/10/13/evaluatingdata-scientist-job-description/ Source: http://www.americanis.net/2014/infographichot-data-science-2015/

Source: http://www.ibtimes.com/amazon-anticipatory-shipping-new-patent-shows-plans-shipproducts-customers-purchase-them-1545950

Source: http://www.ibtimes.com/amazon-anticipatory-shipping-new-patent-shows-plans-shipproducts-customers-purchase-them-1545950

What is Hadoop? Source: The content from this section is summarized from http://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-to-management-whats-the-big-data-deal/

Data is growing fast Three reasons why we are generating data faster than ever: (1) Processes are increasingly automated; (2) Systems are increasingly interconnected; (3) People are increasingly living online.

Data and system evolution

Real-time data analytics The continuous challenge in Web 2.0 is how to improve site relevance, performance, understand user behavior, and predictive insight to influence decisions. Industries - Travel, Retail, Financial Services, Digital Media, Search etc. that are consumer oriented are all facing similar real-time information dynamics.

What is Hadoop? Hadoop is a scalable fault-tolerant distributed system for data storage and processing (Apache license). Core Hadoop has two main systems: (1) Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage. (2) MapReduce: distributed faulttolerant resource management and scheduling coupled with a scalable data programming abstraction.

Hadoop: MapReduce

Hadoop's benefts Flexibility Store any data (structured or not), Run any analysis. Scalability Start at 1TB/3-nodes grow to PB/1000s of nodes. Economics Cost per TB at a fraction of traditional options.

Traditional DBMS Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications. Use relational databases when dealing with (1) Interactive OLAP Analytics; (2) Multistep ACID Transactions (3) 100% SQL Compliance. It is becoming increasingly more diffcult for classic techniques to support the wide range of use cases and workloads that power the next wave of digital business

Hadoop's approach Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data. Hadoop and related software are designed for 3V s: (1) Volume Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity Data ingest speed aided by append-only and schema-on-read design; (3) Variety Multiple tools to structure, process, and access data.

Scenarios for Using Hadoop When a user types a query, it isn t practical to exhaustively scan millions of items. Instead it makes sense to create an index and use it to rank items and fnd the best matches. Hadoop provides a distributed indexing capability. Hadoop runs on a collection/cluster of commodity, shared-nothing x86 servers. You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing and fault tolerant. It can deliver data and can run large-scale, high-performance processing batch jobs in spite of system changes or failures.

Scenarios for Using Hadoop (cont'd) 1) Hadoop as an ETL and Filtering Platform One of the biggest challenges with high volume data sources is extracting valuable signal from lot of noise. Hadoop platforms can read in the raw data, apply appropriate flters and logic, and output a structured summary or refned data set. This output (e.g., hourly index refreshes) can be further analyzed or serve as an input to a more traditional analytic environment like SAS. Typically a small % of a raw data feed is required for any business problem.

Scenarios for Using Hadoop (cont'd) 2) Hadoop as an exploration engine Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refned output is in a Hadoop cluster, new data can be added to the existing pile without having to reindex all over again. In other words, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.

Scenarios for Using Hadoop (cont'd) 3) Hadoop as an Archive Historical data is usually archived by tape or disk to secondary storage or sent offsite. When this data is needed for analysis, it s painful and costly to retrieve it and load it back up. With cheap storage in a distributed cluster, lot s of data can be kept active for continuous analysis. Hadoop is effcient it allows better utilization of hardware by allowing the generation of different index types in one cluster.

The Hadoop Stack Cloudera s Distribution for Hadoop (CDH)

Hadoop's Case Study:

Inverted index Source: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/ searchkit_basics/searchkit_basics.html

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Hadoop: MapReduce

MapReduce: Google Distributed Indexing

Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/

Handling Big Data with MongoDB

What is NoSQL?

RDBMS VS. NoSQL Database transaction properties: Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when servers restart etc. BASE Basically Available Soft-state services with Eventual-consistency.

Why is NoSQL becoming popular?

RDMBS VS NoSQL: an example

RDMBS VS NoSQL: an example

Source: http://en.wikipedia.org/wiki/cap_theorem

http://www.mongodb.org/

Google Trends

MongoDB: Key idea

MongoDB: Auto-sharding

RDBMS vs. MongoDB

Example 1. Create a Java Project 2. Get Mongo Java Driver

Example

Example

Example

Example

Example

Thank You Q&A