Big Data for Architects

Transcription

1 Big Data for Architects Sam MidLink sam@midlink.ca Agenda What is BigData BigData challenges Textbook big data architecture Analysis, visualization & self service BigData in the cloud Where is this all going? 1

2 What is BigData Why is everyone talking about BigData? BigData is a Big Buzzword coined by Big Companies to make Big Money There are a lot of new and exciting technologies behind the buzzword A lot of money has been and is being invested into BigData companies 2

3 What is BigData Commonly defined as a combination of three (the three Vs) Volume Velocity Variety My Definition Data that prevents running the algorithms you want on it in a reasonable amount of time. What is BigData? Big Data is not a buzzword. Big Data is something more and more organizations have Today most consider > 15TB Big Data For real time > 5TB is considered Big Data The internet entities are managing PBs of data Most organization have outgrown their expectation 3

4 Data sources Machine generated Handheld devices Optimizers Analyzers Loggers Probes And a billion sensors Use Cases Advertisement optimization and verification Security and fraud detection Algo-trading Behavioral analysis Network optimization Real time billing Catching terrorists 4

5 Some numbers In memory data bases of 50TB are reasonable Columnar of 2-5PB are reasonable We usually do not consider 15TB big data for analytics This is all on commodity hardware Some numbers AirBus A380 => ~640TB per flight Twitter ~12TB per day NYSE 1TB per day Storage capacity doubles every 3 years We have ~3ZB (Zeta) of date in the world 5

6 Is there value in BigData In 2014 Big Data vendors will pocket nearly $30 Billion from HW, SW and PS (2020 => $76B) :marketwatch.com How did Target know your 16 years old is pregnant before you did? Would you like to know the weather tomorrow? Take a look at Goldcorp Challenge Fight Cancer, Crime, Starvation & pollution There is one CCTV camera for every 32 people in the UK Leveraging BigData technologies BigData technologies can be used to solve other problems 50GB is not BD but reducing foot print is big! Postgre the new DB since 1989 CEP (Complex Event Processing) have replaced rules engines Open source is leading the way 6

7 Big Data Challenge Data and Analytical Complexity Extracting actual business value Data-driven decision making BigData challenges Organizations capture far more data than they ever have Storage is usually not the issue No two organizations are the same, nor are their data sets What most want is to analyze the data and extract value 7

8 Data Challenges & Complexity Volume Collecting Storage & backup Scanning (querying) Velocity Data that has to be processed in real time Parallelism Correlation (over time and between series) Data Challenges & Complexity Variety Unstructured data along side structured data Internal along side external data A variety of formats MP4, AVI, Avro etc Distribution Multiple data centers Regulation Communication disconnection 8

9 Data Challenges & Complexity Life cycle management Retention policy What is a enough sampling? 2, 3 xmases? Stale data (in different systems) Coherence/Consistency Between systems & data sources Integration Data collection and augmentation Multiple systems integration (islands of data) Security Data Challenges & Complexity All your data is in one location Who is allowed to access which data? Multiple tenancy One version of the truth Where do you live right now? How many of the leads did convert? 9

10 Data Challenges & Complexity Working with Legacy systems prolong life time Self-service Investigative access to ALL users Non techies included Keeping access permissions in mind Ever changing data model High touch vs. Low touch data Textbook BigData architecture 10

11 11 BigData the BigPicture partial Amazon Actian No SQL Google Yahoo Datastax EnterpriseDB Sharding Columnar Oracle SAP Apache Twitter Visualization Hadoop IBM HP EMC Tableau LinkedIn Datameer Cloud OLTP Appliance In-Memory

12 BigData echo system ETL Code Enrichment BI Storage Deep Analytics Lifecycle Virtualization Considerations Is data your business? How much data do you have? Where is it now? What is your growth projection? Can you take it outside the organization? Do you need to augment? Do you need the raw data? 12

13 Considerations For how long do you need the data? Can you save partial data/aggregates? Can you join/dedup/aggregate historical data? Do you need real-time processing? Do you have frequent schema changes? Do you need to do OLAP & OLTP on the same system? Any limitations HW/OS/MEM? Climbing up the ladder Will fast storage be enough? Will open source cut it? Or, what part do you use open source for? Which route is best? Hadoop, NoSQL, InMemory, Columnar etc What do you use to code? Which ETL do you use? How do you visualize? 13

14 14 Textbook architecture (lambda) A bit more depth

15 Introducing the major players Kafka.apache.org Apache Kafka is publish-subscribe messaging rethought as a distributed commit log Open sourced and maintained by LinkedIn Fast Scalable Durable Distributed by Design Persistent 15

16 Kafka how it works Publishers send messages to a cluster of brokers The brokers persist the messages to disk Consumers can request a range of messages Offset and Length Everything is distributed Pub, Sub & Queues Consumers maintain their own state (No TX) Throughput oriented Kafka log based queue Messages are persisted to append-only logs Sequential writes and reads Topics are distributed queues (partitions) Partitions are replicated Leader & followers Producers load balance Instances know each other Ack level can be set 16

17 Kafka A single broker can handle hundreds of megabytes of reads and writes per second from thousands of clients A cluster is a great data transfer backbone Data is retained based on a preset time (2 days) Guaranties order of messages per partition Uses java.nio.filechannel#transferto (sendfile system call) Storm CEP Complex Event Processor Platform for analyzing streams of data as they occur Highly distributed, real-time computation system Provides primitives for real time computation Simplifies working with queues and workers Fault tolerant and scalable Complementary to Hadoop Created at Backtype and acquired by Twitter in 2011 Component of Apache Incubator from

18 Storm Components Master node - Nimbus daemon Distributes code around the cluster, assigns the tasks and monitors failures Worker node - Supervisor daemon Listens to the work assigned, runs the worker process Zookeeper maintains the coordination service between the supervisor and master Storm Components Stream - An unbounded sequence of tuples Spout - A stream source. Reads data from real data sources (logs, API calls, event data..) and generates a stream.. Bolt - processes input streams (joins, filters, aggregations...) and produces output streams. Contains data processing, persistence, and messaging alert logic. Spouts and Bolts execute as Tasks across the cluster 18

19 19 Storm Components Spouts and Bolts are packaged into a Topology Topology is submitted to Storm clusters for execution and runs indefinitely until it is manually terminated Storm use-cases Stream processing of tweets Real-time log processing Sensor data analysis Financial market analysis Natural Language Processing Online advertising

20 Hadoop High Availability Distributed Object Operation Platform Distributed infrastructure for parallel processing Initially HDFS & Map/Reduce Ability to grow to thousands of machines Virtualizes the hardware Redundant Simplifies growth & parallel processing Distros: Cloudera, Hortonworks, Apache, IBM HDFS Hadoop Distributed File System Distributed file system for redundant storage Uses commodity hardware Supports big files (PBs) + large quantities of files Write-once-ready-many Built for HW failure Better suited for batch processing 20

21 21 HDFS Architecture HDFS Master/Slave Data is organized into files and directories Files are divided into uniform sized blocks Master Namenode Manages the file system namespace in memory Maintain file name to list blocks + location mapping Manages block allocation/replication Checkpoints namespace and journals namespace changes for reliability Control access to namespace

22 HDFS Master/Slave Slaves Datanodes handle block storage Blocks are stored using the underlying OS s files Clients access the blocks directly from datanodes Periodically sends block reports to Namenode Periodically check block integrity Blocks are replicated (=3) File system keeps checksums of data for corruption detection and recovery HDFS exposes block placement so that computation can be migrated to data Map/Reduce Programming model for distributed computation jobs at a massive scale A framework to organize and execute such job The idea is take the logic to the data and not vice versa 22

23 Map/Reduce Map Inspect a huge amount of data Get something of interest Shuffle and sort the interesting data Reduce Aggregate the interesting data Generate a final report Think of grep sort aggregate by customer ID Map/Reduce 23

24 Hadoop the big picture Pig Map/Reduce is a lot of work to write Pig is a High-level data flow language Data processing language Compiler to translate Pig Latin to Map/Reduce 24

25 Pig example Users = load users as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Jnd = joinfltrdby name, Pages by user; Grpd = groupjndbyurl; Smmd = foreachgrpdgenerate group,count(jnd) as clicks; Srtd = ordersmmdby clicks desc; Top5 = limitsrtd 5; store Top5 into top5sites ; Hive SQL with Hadoop Puts a schema/structure to log data stored in HDFS Provides an SQL-like query language SQL is complied into a chain of M/R jobs Usually faster than Java M/R because of the optimizer in the complier Has a command shell interface Tables can be associated with a serializer /deserializer class - parse data into the table Tables can be partitioned and bucketed 25

26 Hive high level architecture HBase Open source non-relational distributed columnoriented database on top of HDFS (row key, column family, column, timestamp) -> value Real time read/write access to data in HDFS No: Sql, Join, TX Yes: Scalable, Redundant, Consistent Java, Thrift & REST API 26

27 Cloudera Impala High performance general purpose SQL query language for Hadoop not based on M/R Written in C++ with dynamic code generation Impalad runs on every node Accepts client request, plans & executes the query Statestored used to find the data Provides name service & metadata distribution Catalogd metadata updates to all nodes Cloudera impala performance I/O bound x3-4 then hive Multi M/R phases in Hive x45 faster in impala Queries against memory cached data x90 27

28 Parquet Columnar storage across the Hadoop platform Reduces IO requirement Saves space (better compression) Different encoding for different types of data Additional metadata Page/Column/File statistics Implemented both in Java for M/R & C++(Impala) Can be used with Impala, Hive, Pig & M/R Impala & Hive can access the same parquet tables Columnar databases Stores information by columns, rather than rows This is much better for data crunching applications and DWH than using row storage. The table below will be stored on disk in this order: 1, 2, 3-10:17, 11:11, 13:15-1:37, 5:13, 3: , 23456, Brands ParAccel a.k.a Matrix Vertica RedShift (cloud) Call ID Time Duration Number 1 10:17 1: :11 5: :15 3:

29 Columnar database features Homogenous blocks on disk allow for better compression/encoding of data Storage reduction by X3 to X12 Massive Parallel Processing (MPP) on commodity HW enabled by correct distribution keys Take the code to the data Shared nothing Standard SQL language and relational schema UDF (User Defined Functions) like stored procedures Columnar database performance Columnar addressed the disk I/O bottle neck specifically targeting seek time Designed for BI oriented queries Select 7 columns from a table with 28 columns will fetch ¼ of the data as each block has info from only one column The data will also be mostly sequential so no seek time Compression will reduce amount of data fetched even more Local processing on each node (shared nothing) Denormalize, denormalize, denormalize 29

30 Trickle loading is a tricky Columnar database pitfalls Concurrency might be an issue Internal network speed is crucial Data redistribution is an issue Updates are out of the question Joins can be an issue However, for that you get X10 to X100 faster queries, less storage, bulk inserts of XTB an hour and excellent simultaneous load and query it is worth it In-Memory database A DBMS the only uses main memory (RAM) to store data In Memory replication is used for High Availability Usually these DBs do an Asyncbackup to an external file or DB NVDIMM now enables to run at full speed and maintain data in the event of power failure These DBs are usually ACID and use ANSI SQL 30

31 In-Memory database Designed for extreme OLTP processing Some IMDBs (In Memory DB) are MPP Close to linear scaling (some claim a=0.85) Some use smart mechanisms to only hold HOT data in memory Memory is 3,000 times faster then disk Cost in constantly dropping while size is increasing In-Memory examples Hana Column-oriented, Relational, MPP RDBMS Convergence of OLTP and OLAP Analytics VoltDB MPP, Relational, RDBMS, ACID, ANSI SQL, Java UDF No write a head, no redo log, no access to disk in TX Real time response & async-offload (File,Columnar,Hadoop) Hot vs Warm data (SSD) and K-Safety for HA Execute queue concept with worker thread per core 31

32 Analysis, visualization and self service ETL ETL is still mostly a traditional realm Goldengate, Informatica & Python are very much in use as an addition/instead of Hadoop More sources and targets as each BigData technology is targeted at a different requirement Data travels more Multiple new BigData technologies side by side with Traditional OLTP/Operational systems. 32

33 Traditional BI tools Traditional BI tools still play a big roll OBIEE, Cognos, BO, Microstrategy Have connectors to BD technologies like Impala & Hive Most use JDBC or ODBC connectivity Enable federated reports from multiple sources Still require building a world Are still relatively cumbersome Provide most users with limited access to limited data Self service BI With more and more data collected & its value rising, accessibility is becoming a big issue Canned reports simply don t cut it anymore Business users want to add a local excel file they have to the data and run correlations Exported partial data is less valuable Internal & external users want more access to data and the more they get the more they desire 33

34 Self service Data consumers want to run their own analysis, do their own exploration and use simple tools They are not techies Security and constraints are required for self service, especially in multi tenants environments The trick in self service is to give access to all the data required and only that data Ability to easily manipulate data is critical New visualization tools The new visualization tools are still traditional Tableau, Qlickview and others offer a sexier, easier to use interface, intuitive for more users Point and click discovery and reporting Do not offer a paradigm shift New BI & visualization tools are just emerging Datameer, Platfora Mostly around the large install base of Hadoop 34

35 New BI tools Datameer as an example Paradigm shifts only happen with new BI tools that step out of the old realm Changing the way we think about BI No ETL just EL No building of worlds Working directly and exclusively on Hadoop Easy integration and import of data An Excel like interface for analytics Datameer UI 35

36 Datameer capabilities Easy integration and import of information from files, logs, database and many other sources Data manipulation done via excel. SQL or M/R knowledge is not required Administrator can control data access at the column level Power Point like visualization A new way of looking at self service BI BigData in the cloud 36

37 BigData in the cloud More and more organization place their data in the cloud, thus the data in the cloud is exploding Pay as you go, for what you use Elasticity Grow when required think of changes as duplicate, change, test, switch drop the old Decrease administration overhead Use state of the art technologies Google BigQuery Big Data analysis requires expensive hardware and skilled DB administrators Managing data centers and tuning software is time consuming and expensive Why not to use web services as analytic tools instead? BigQuery - a fully managed data analytics service in the cloud 37

38 38 What is BigQuery? Service for interactive analyzing of big datasets Works in conjunction with Google Storage Uses SQL like query syntax Web-service accessed by a RESTful API 99.9% reliable and secure: Data replicated across multiple data centers Secured through Access Control Lists Scalability to any number of users What is BigQuery?

39 39 BigQuery echo system Part of the Google cloud offering App engine(app execution) Compute Engine (linuxvms) Storage BigQuery Works with most BI & Visualization tools Direct access from Google App engine Securely share & distribute the results Offered with Premium support BigQuery - Technology Columnar MPP Uses tree structure for distribution Limitations Only one join per query Relatively slow and runs out of resources (rare)

40 Amazon RedShift Data warehousing done the AWS way Easy to provision and scale up massively Pay as you go Really fast performance Open and flexible with support for popular tools Petabytescale Amazon RedShift Dramatically reduce I/O by Column storage Data compression Zone maps Direct-attached storage Large data block sizes Optimized HW 128GBRAM, 16Cores, 16TB Disk 40

41 Amazon RedShift Parallelize and distribute everything Query Load Backup Restore Resize Amazon RedShift more features Monitor query performance Point & click resize or recreate Built in security Automatic backups Integrates with multiple data sources Use existing analysis tools There are limitations like max 15 users 41

42 42 DynamoDB Fully managed nosql DB service on AWS Table based (each table is independent) Data stored in the form of name - value attributes Automatic scaling Provisioned throughput Storage scaling Distributed architecture Monitoring tables with CloudWatch Integration with Elastic MapReduce Analyze and store in S3 DynamoDB Schema free Fast to find using primary and range keys. Support for complex queries (scan) Consistency by default with high cost to ensure it Must use SDK/API to access Complex queries are made using Sequential/Full table scan (high cost) Designed to be always writable

43 43 DynamoDB Architecture Data spread across hundreds of servers (nodes) Multiple versions of data across multiple nodes Conflict resolutions during reads (not writes) Servers form a cluster in a form of a ring Client connection through: Routing using a load balancer Client-library that reflects Dynamo s partitioning scheme and can determine the storage host to connect DynamoDB Limitations 64 KB limit on item size (row size) 1 MB limit on fetching data Pay more if you want strongly consistent data Size is multiple of 4KB (provisioning throughput wastage) No table joins Indexes are created during table creation only No triggers Limited comparison capability Limited data types(text, number, binary)

44 44 Amazon Elastic MapReduce (EMR) A web service to process vast amounts of data On demand Hadoop cluster Store data on Amazon S3 Scale the number of virtual servers in your cluster to manage your computation needs Start Hadoop cluster to process data Turn off when done Pay for the hours used EMR integrates seamlessly with AWS services EMR

45 EMR Hadoop clusters running on Amazon EMR use: EC2 instances as virtual Linux servers for the master and slave nodes Amazon S3 for bulk storage of input and output data CloudWatch to monitor cluster performance and raise alarms Move data into and out of DynamoDB using Amazon EMR and Hive (Used by EMR) EMR considerations EMR reduces Hadoop management complexity/costs Compared to a local cluster - Lower performance of M/R jobs Reduced data throughput (S3 cannot be compared to a local hdd) EMR Hadoop is usually not the latest version Trading performance for convenience, cost and scalability 45

46 46 Kinesis AWS service for real time processing of streaming data Easy administration Performs continuous processing on streaming big data Scales seamlessly to match the operational needs Redshift and DynamoDB integration Client libraries allow designing and operating real-time streaming data applications Ensures high durability and availability of data by replicating across multiple Availability Zones Cost efficient for workloads of any scale Kinesis

47 47 Kinesis architecture - Input Kinesis streams are sharded Each shard ingests up to 1MB/sec of data and up to 1000 tps Data is stored for 24 hours Scaling is done by adding or removing shards To store data in a stream producers use a PUT call PUTs are distributed across the shards by using a Partition Key Kinesis architecture -Output You must design distributed, fault tolerant and scalable application that can keep up with the stream Use Kinesis Client Library to: Simplify reading from the stream Automatically start Kinesis workers Adjust the number of workers as the number of shards changes Restart workers if they fail and redistribute to use the new EC2 instances EMR can be used to read and process data from streams

48 48 Kinesis pricing Pay as you go. No up-front costs Hourly shard rate - $0.015 Per 1,000,000,000 PUTs - $0.028 Customers specify throughput requirements in shards Each shard delivers 1MB/s on ingest and 2MB/s on egress Inbound data transfer is free EC2 charges apply for Kinesis processing apps Now let s draw our own textbook architecture

49 49 Will hardware catch up? Growth

50 Technology to balance growth A lot of the BigData technology revolved around tackling disk speed and size Avoiding seek time Working in memory Smart Caching Order data homogeneously for better compression Duplicate data for bandwidth purposes Eventually consistent The storage world is not idle Storage today is faster with more capacity and smarter 1 million IOPs is easy Violin 70TB of usable space in a 3U SSD box XtreamIO Extreme deduplication in SSD arrays Infinidat 2PB in a 42U rack ExaData Database logic executed at hardware level Fusion-IO 5.2TB with 2.7GB/s Bandwidth onboard 50

51 Storage revives legacy Legacy system longevity can be greatly prolonged by advanced storage The easiest upgrade path just copy the data Limitations like data size, TX per second and other boundaries can be taken down by storage And that is before DNA Storage 5.5PB on one cubic millimeter What does the future hold? Who knows? 51

52 Information Week innovators 1. MongoDB 2. Amazon (Redshift, EMR, DynamoDB) 3. Cloudera(CDH, Impala) 4. Couchbase 5. Datameer 6. Datastax 7. Hadapt 8. Hortonworks 9. Karmasphere 10. MapR 11. Neo Technology 12. Platfora 13. Splunk A few facts Data will continue to explode Value will become harder to harvest Storage will get cheaper CPUs will be faster Cost will be a big player The realm remains prone to disruption Innovation is shifting from Technology to implementation 52

53 53 Sam MidLink sam@midlink.ca