The Internet of Things and Big Data: Intro

Size: px

Start display at page:

Download "The Internet of Things and Big Data: Intro"

Shauna Chase
9 years ago
Views:

1 The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd,

2 What This Is; What This Is Not It s not specific to IoT It s not about any specific type of data or protocol It s not specific to any particular industry It s about processing big data IoT data can be big data IoT might be the biggest data of the coming decade But it s just big data Same strategies & technologies apply 2

3 3

4 4

5 When Does Data Become Big? When the size of the data, itself, becomes a problem When the old way of processing data just doesn t work effectively It s big when we have to rethink: How we store that much data How we move that much data How we extract, load & transform that much data How we explore and analyze that much data How we process and get meaningful insights from that much data 5

6 C mon! What does that mean in size? Not gigabytes Most likely not a few terabytes Possibly not 10 s of terabytes Probably 100 s of terabytes Definitely petabytes 6

7 So How Do We Handle Big Data? Distribute & parallelize! 7

8 MPP Analytic Databases or Hadoop 8

9 Big Data Analytics Bridging classic & big data worlds SQL performance and structure Classic Method Structured & Repeatable Analysis Business determines what questions to ask Hadoop scale and flexibility IT structures the data to answer those questions Capture only what s needed IT delivers a platform for storing, refining, and analyzing all data sources Capture in case it s needed Big Data Method Multi-structured & iterative analysis Business explores data for questions worth answering 9

10 Philosophical Differences Traditional Methods More power Summarize data Transform and store Pre-defined schema Move data -> compute Less data / more complex algorithms Big Data More machines Keep all data Transform on demand Flexible / no schema Move compute -> data Mode data / simple algorithms 10

11 answer = f(all data) Save all raw data Data immutability Transform as needed Result is based on the raw data 11

12 Q & A Engage with maprtech mapr-technologies MapR [email protected] maprtech 12

13 Iot and Big Data: Hadoop as a Data Platform John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd,

14 Hadoop: The Disruptive Technology at the Core of Big Data 14

15 Forces of Adoption Hadoop TAM comes from disrupting enterprise data warehouse and storage spending IT BUDGETS GROWING AT 2.5% DATA GROWING AT 40% $ PER TERABYTE $40,000 Data IT Budgets $9,000 <$1, ENTERPRISE STORAGE DATABASE WAREHOUSE HADOOP Gartner, "Forecast Analysis: Enterprise IT Spending by Vertical Industry Market, Worldwide, , 3Q12 Update. Wall Street Journal, Financial Services Companies Firms See Results from Big Data Push, Jan. 27,

16 Hadoop 101 (External Presentation) MapR MapR Technologies Technologies 16

17 Hadoop Hardware MapR MapR Technologies Technologies 17

18 Typical Compute Node Two CPUs, each with 4-8 cores per CPU GB Memory 6-24 hard disks GB Network cards 18

19 Hadoop Ecosystem MapR MapR Technologies Technologies 19

20 Ecosystem of Projects Built of Hadoop 20

21 SQL On Hadoop MapR MapR Technologies Technologies 21

22 SQL on Hadoop Generally data has no inherent schema Schema is defined by user / interpreted from structure Schema is applied during processing One file can have many schemas applied Works for many kinds of data but not all Temperature sensor data? Sure Video feeds? Not really 22

23 Key Use Cases Big Data Analysis 2 Big Data Exploration Large-scale SQL queries on long history Well defined schema Known value, but high cost in existing systems Exploratory analysis on large scale raw data Unknown value No defined schema Variety of data types 23

24 What is Driving the Need for SQL-on-Hadoop? Organizations are looking for Reuse existing tools and skills to unlock Hadoop data to broader audience Analysis on new types of data More complete data analysis More up-to-date and real-time data analysis (not just after the fact ) 24

25 SQL on Hadoop: Many Options Flexibility to choose when to use which based on use case Drill 1.0 Hive 0.13 with Tez Impala 1.x Presto 0.56 Shark 0.8 Vertica Latency Low Medium Low Low Medium Low Files Yes (all Hive file formats) Yes (all Hive file formats) Yes (Parquet, Sequence, ) Yes (RC, Sequence, Text) Yes (all Hive file formats) HBase/M7 Yes Yes Various issues No Yes No Schema Hive or schemaless Yes (all Hive file formats) Hive Hive Hive Hive Proprietary or Hive SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL + advanced analytics Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC, ADO.NET, Large joins Yes Yes No No No Yes Nested data Yes Limited No Limited Limited Limited Hive UDFs Yes Yes Limited No Yes No Transactions No No No No No Yes Optimizer Limited Limited Limited Limited Limited Yes Concurrency Limited Limited Limited Limited Limited Yes 25

26 Proven Hadoop Production Success ENTERPRISE DATA HUB MARKETING ANALYTICS RISK ANALYTICS OPERATIONS INTELLIGENCE Multi-structured data staging & archive ETL / DW optimization Mainframe optimization Data exploration Recommendation engines & targeting Ad optimization Pricing analysis Lead scoring Network security monitoring Security information & event management Fraudulent behavioral analysis Supply chain & logistics System log analysis Manufacturing quality assurance Preventative maintenance Sensor analysis 26

27 Other Tools & Frameworks of Note MapR MapR Technologies Technologies 27

28 Pig Procedural Language Loops, if-then statements 28

29 Map Reduce Framwork Lingual: SQL-like operations Pattern: Machine Learning Applications Scalding: Cascading for Scala Cascalog: Cascading for Clojure 29

30 Python, Scala and Java Spark powers a stack of high-level tools including Shark for SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application. 30

31 Machine Learning / Predictive Analytics Collaborative Filtering Linear / Logistic Regression Naïve Bayes Random Forests K-Mean Clustering Canopy Clustering Principal Component Analysis 31

32 Database on Hadoop Highly scalable Columnar Flexible schema Data source for Map Reduce and Spark jobs 32

33 Q & A Engage with maprtech mapr-technologies MapR [email protected] maprtech 33

34 Iot and Big Data: Architectures & Use Cases John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd,

35 NoSQL MapR MapR Technologies Technologies 35

36 NoSQL Databases No-SQL or Not only SQL Give up some of the functionality of traditional relational databases for speed and scalability Types Key-Value Columnar Document Graph NoSQL databases favor flexible schemas 36

37 HBase 37

38 Queues MapR MapR Technologies Technologies 38

39 Queues Just like a queue at an amusement park First-in-first out Queues messages or events 39

40 Message Queue 40

41 Stream Processing MapR MapR Technologies Technologies 41

42 Stream Processing Handles data at high velocity If Hadoop is the ocean, streams are the firehose Processing in near real-time 42

43 Storm 43

44 Batch Processing MapR MapR Technologies Technologies 44

45 Combination Architectures MapR MapR Technologies Technologies 45

46 Lambda Architecture 46

47 Complex Architectures Using Many Big Data Technologies 47

48 Wanna Play? 48

49 Q & A Engage with maprtech mapr-technologies MapR [email protected] maprtech 49

50 MPP Analytic Databases or Hadoop 50

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data