Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology Development Agency (NSTDA)
Presentation outline Big Data Technology Overview, definition, motivation and properties BI vs. Data science Applications Big Data Tools Hadoop NoSQL MongoDB
What is Big Data?
Big Data: Motivation Source: http://www.esg-global.com/blogs/big-data-a-better-definition/
Structured vs. Unstructured Data Source: http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
Structured vs. Unstructured Data Source: http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
Big Data vs. Traditional Data Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
7 Key Drivers for the Big Data Market Source: http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
3Vs of Big Data
3Vs of Big Data
4Vs of Big Data
4Vs of Big Data
4Vs of Big Data
4Vs of Big Data
4Vs of Big Data
Where does Big Data come from? Source: http://www.ibmbigdatahub.com/infographic/where-does-big-data-come
Gartner s Hype Cycle for Big Data chart
Big Data Business Model Maturity Chart Source: https://infocus.emc.com/william_schmarzo/big-data-business-model-maturity-chart/
Big Data Business Model Maturity Chart Source: https://infocus.emc.com/william_schmarzo/big-data-business-model-maturity-chart/
Big Data Business Model Maturity Chart Source: Bill Schmarzo, Big Data: Understanding How Data Powers Big Business
Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics, http://reflectionsblog.emc.com/2014/08/business-intelligence-analyst-data-scientist-whats-difference/
Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics, http://reflectionsblog.emc.com/2014/08/business-intelligence-analyst-data-scientist-whats-difference/
Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics, http://reflectionsblog.emc.com/2014/08/business-intelligence-analyst-data-scientist-whats-difference/
Data Scientist Source: http://insidebigdata.com/2013/10/13/evaluatingdata-scientist-job-description/ Source: http://www.americanis.net/2014/infographichot-data-science-2015/
Source: http://www.ibtimes.com/amazon-anticipatory-shipping-new-patent-shows-plans-shipproducts-customers-purchase-them-1545950
Source: http://www.ibtimes.com/amazon-anticipatory-shipping-new-patent-shows-plans-shipproducts-customers-purchase-them-1545950
What is Hadoop? Source: The content from this section is summarized from http://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-to-management-whats-the-big-data-deal/
Data is growing fast Three reasons why we are generating data faster than ever: (1) Processes are increasingly automated; (2) Systems are increasingly interconnected; (3) People are increasingly living online.
Data and system evolution
Real-time data analytics The continuous challenge in Web 2.0 is how to improve site relevance, performance, understand user behavior, and predictive insight to influence decisions. Industries - Travel, Retail, Financial Services, Digital Media, Search etc. that are consumer oriented are all facing similar real-time information dynamics.
What is Hadoop? Hadoop is a scalable fault-tolerant distributed system for data storage and processing (Apache license). Core Hadoop has two main systems: (1) Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage. (2) MapReduce: distributed faulttolerant resource management and scheduling coupled with a scalable data programming abstraction.
Hadoop: MapReduce
Hadoop's benefts Flexibility Store any data (structured or not), Run any analysis. Scalability Start at 1TB/3-nodes grow to PB/1000s of nodes. Economics Cost per TB at a fraction of traditional options.
Traditional DBMS Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications. Use relational databases when dealing with (1) Interactive OLAP Analytics; (2) Multistep ACID Transactions (3) 100% SQL Compliance. It is becoming increasingly more diffcult for classic techniques to support the wide range of use cases and workloads that power the next wave of digital business
Hadoop's approach Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data. Hadoop and related software are designed for 3V s: (1) Volume Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity Data ingest speed aided by append-only and schema-on-read design; (3) Variety Multiple tools to structure, process, and access data.
Scenarios for Using Hadoop When a user types a query, it isn t practical to exhaustively scan millions of items. Instead it makes sense to create an index and use it to rank items and fnd the best matches. Hadoop provides a distributed indexing capability. Hadoop runs on a collection/cluster of commodity, shared-nothing x86 servers. You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing and fault tolerant. It can deliver data and can run large-scale, high-performance processing batch jobs in spite of system changes or failures.
Scenarios for Using Hadoop (cont'd) 1) Hadoop as an ETL and Filtering Platform One of the biggest challenges with high volume data sources is extracting valuable signal from lot of noise. Hadoop platforms can read in the raw data, apply appropriate flters and logic, and output a structured summary or refned data set. This output (e.g., hourly index refreshes) can be further analyzed or serve as an input to a more traditional analytic environment like SAS. Typically a small % of a raw data feed is required for any business problem.
Scenarios for Using Hadoop (cont'd) 2) Hadoop as an exploration engine Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refned output is in a Hadoop cluster, new data can be added to the existing pile without having to reindex all over again. In other words, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.
Scenarios for Using Hadoop (cont'd) 3) Hadoop as an Archive Historical data is usually archived by tape or disk to secondary storage or sent offsite. When this data is needed for analysis, it s painful and costly to retrieve it and load it back up. With cheap storage in a distributed cluster, lot s of data can be kept active for continuous analysis. Hadoop is effcient it allows better utilization of hardware by allowing the generation of different index types in one cluster.
The Hadoop Stack Cloudera s Distribution for Hadoop (CDH)
Hadoop's Case Study:
Inverted index Source: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/ searchkit_basics/searchkit_basics.html
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Hadoop: MapReduce
MapReduce: Google Distributed Indexing
Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search, http://web.stanford.edu/class/cs276/
Handling Big Data with MongoDB
What is NoSQL?
RDBMS VS. NoSQL Database transaction properties: Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when servers restart etc. BASE Basically Available Soft-state services with Eventual-consistency.
Why is NoSQL becoming popular?
RDMBS VS NoSQL: an example
RDMBS VS NoSQL: an example
Source: http://en.wikipedia.org/wiki/cap_theorem
http://www.mongodb.org/
Google Trends
MongoDB: Key idea
MongoDB: Auto-sharding
RDBMS vs. MongoDB
Example 1. Create a Java Project 2. Get Mongo Java Driver
Example
Example
Example
Example
Example
Thank You Q&A