Day with Development Master Class Big Data Management System DW & Big Data Global Leaders Program Jean-Pierre Dijcks Big Data Product Management Server Technologies
Part 1 Part 2 Foundation and Architecture of a BDMS Streaming & Batch Data Ingest and Tooling 3
Storing Data in HDFS and its relation to performance Space Usage vs. Type Complexity Data types like JSON are popular, especially when exchanging data or when capturing messages Simple JSON documents can be left in their full state If the documents are deeply nested it pays to flatten these upon ingest The consequence is of course an expansion of the data size But, things like joins and typical analytics on Hadoop will perform better on simpler objects In regulated industries, it pays to keep the original JSON as well as the decomposed structure. Ensure to compress the original and save away in a source directory 4
General Classification Stream Flume GoldenGate Kafka Batch HDFS Put (S)FTP NFS wget curl & HDFS Put wget curl & (S)FTP NFS Push Pull 5
Batch Loading data into HDFS Pushing Data 6
Don t do this HDFS in this case should replace any additional filers SAN or NAS filers B 7
Instead try to do this Add either an FTP or a Hadoop client to the source Major Benefits from this simple change: Reduces amount of NAS/SAN storage => COST savings Reduces complexity Reduces data proliferation (improved security) Hadoop or FTP client go here B 8
Using Hadoop Client to Load Data Source Server Big Data Appliance HDFS nodes Client HDFS Put issued from Hadoop Client on SRC server Linux FS HDFS 9
Using Hadoop Client to Load Data Source Server Big Data Appliance HDFS nodes Client HDFS Put issued from Hadoop Client on SRC server Enables direct HDFS writes without intermediate file staging on Linux FS Easy to scale: Initiate concurrent puts for multiple files HDFS will leverage multiple target servers and ingest faster Linux FS HDFS 10
FTP-ing onto Local Linux File System Basic Flow Install FTP server on BDA node(s) 1. FTP files onto local Linux FS on BDA Something like /u12 Some FTP clients can write to WebHDFS 2. Use HDFS Put to load data from Linux FS into HDFS 3. Remove files from Linux FS 4. Repeat Big Data Appliance HDFS nodes Linux FS HDFS 11
FTP Managing Space for Linux and HDFS on Ingest nodes You cannot (today) de-allocate a few disks from HDFS on BDA So, you should therefore: Set a quota on how large HDFS can grow on the ingest nodes Set a quota at the linux levels to regulate space Sizing depends on The ingest and cleanup schedule The ingest size Peak ingest sizes Linux FS Big Data Appliance HDFS HDFS nodes 12
FTP High Availability Source Server Big Data Appliance HDFS nodes Run multiple FTP servers on multiple BDA nodes Provide a load balancer like HA Proxy (included with Oracle Linux) HA Proxy Linux FS HDFS 13
Batch Loading data into HDFS Pulling Data 14
Pulling data with wget or curl and Hadoop Client Big Data Appliance Source Server HDFS nodes Client HDFS Put issued from Hadoop Client on SRC server Use wget or curl to initiate data transfer and load Linux FS HDFS 15
Pulling data with wget or curl and Hadoop Client Big Data Appliance Source Server HDFS nodes Client Pipe straight through to HDFS put Can use FTP/HTTP as well All observations from previous section apply HDFS Put issued from Hadoop Client on SRC server Use wget or curl to initiate data transfer and load Linux FS HDFS 16
Grabbing Data from Databases ORCL Big Data SQL - Copy to BDA - Table Acces to BDA Big Data Appliance HDFS nodes SQL Object Based Sqoop Change Capture ORCL GoldenGate 17
1) Avoid any additional external staging systems as these system reduce scalability 2) Opt for tools and methods that write directly into HDFS like HDFS put 18
Moving Mainframe data into HDFS Batch Files 19
Using Golden Gate to Replicate from Mainframe Mainframe GG can replicate from MF Database Big Data Appliance HDFS nodes GoldenGate Apply directly into HDFS or HBase 20
Mainframe Data Mainframe General Assumption: Any data collection on MF needs to be non-intrusive due to security and cost (MIPS) reasons. Existing jobs typical generate files SyncSort SyncSort is one of the leading MF tools FTP (via ETL Tools) from MF to recipient systems
Using file transfers to move from Mainframe Mainframe SyncSort Follow the push and pull mechanisms discussed earlier Big Data Appliance HDFS nodes Pull or Push Data 22
Using file transfers to move from Mainframe Most MF files will be EBCDIC format and need to be converted to ASCII 1. Land on local disk (Linux FS) 2. Put files into HDFS 3. Convert from EBCDIC to ASCII using standard tooling (ex: SyncSort) on Hadoop 4. Optional: Copy ASCII file and compress together with original EBCDIC files 5. Archive original (with ASCII file if done step 3) file 6. Delete original files from Linux FS 7. Repeat Big Data Appliance HDFS nodes 23
1) Keep Transfer SW as simple as possible 2) Move as much processing of files from MF to BDA, and use proven tools for EBCDIC to ASCII conversions 24
Streaming Data Product Approach 25
Various tooling options Apache Kafka seems to be a (new) favorite Oracle GoldenGate just added a big data option enabling streaming from GG sources into HDFS and Hive for example Oracle Event Processing enables a rich developer environment and low latency stream processing See the representative documentation for details, usage and restrictions Note the distinction between Transport and Processing OEP is an example of stream processing, whereas Kafka is stream transport 26
What should I use now that I am streaming data? Chances are you have no choice Your sources are publishing data onto a messaging bus Your organization already has a streaming system in place Nevertheless the following section will attempt to clarify this question 27
Apache Flume (NG) Currently one of the most common tools, with many pre-built sources and sinks, some other interesting aspects: Scalable with fan-in and fan-out capabilities Direct write into HDFS Can evaluate simple in stream actions Part of CDH and supported as such Use for streaming when: Simple actions need to be evaluated Reasonable latency is ok Scalability is key You are using this for other data sources 28
Oracle Event Processing Low latency, with easy to use visual modeling environment and its own DSL called Continuous Query Language (CQL): Available for data center as well as embedded, enabling large fan-in setups for IoT like systems Direct write into HDFS as well as Oracle NoSQL DB Can evaluate complex in stream actions, leveraging readings from NoSQL, Oracle Database and can leverage for example Oracle Spatial Focuses on very low latency and complex actions Use for streaming when: You need low latency, embedded and complex actions, expanding to IoT You are looking for mature tooling, an easy to use DSL 29
Apache Kafka Highly scalable messaging system (Linked-In): Pub-Sub mechanism Distributed and highly resilient Highly scalable even when serving a mix of batch and online consumers No action evaluation capabilities (needs external tooling for this) Use for streaming when: You are looking for a scalable messaging system You are dealing with very high volumes You can code a number of things when needed 30
Conclusion Use Flume for specific use cases: Rolling log files Why? Flume has a lot of specific code available to deal with a large number of log formats and writes directly into HDFS Use OEP when you need event processing Processing: Complex rules are applied across the spectrum You need embedded systems (standardize) Use Kafka when: You have the skills or can acquire them Transportation: You are looking for massive scale queuing / streaming 31
Streaming data into HDFS Pushing Data 32
Flume Streaming logs to HDFS Big Data Appliance HDFS nodes Webserver Flume Log4j Client Flume Agent Flume HDFS Sink Note, Flume enables simple event processing as well as direct movement into HDFS or other sinks 33
Flume Streaming logs to HDFS Flume Concepts Client captures and transmits events to the next hop Agent Agents can write to other agents through sinks Flume Client source channel Flume Sink Source receives events and delivers these to one or more channels Channel receives the event, which gets drained by sinks Sink either finishes a flow (ex. HDFS sink) or transmits to the next agent 34
Flume Streaming logs to HDFS Splitting Streams / Multi-Consumer Webserver Flume Log4j Client Flume Source Flume Agent Flume Channel DR Site Flume Channel Production Same data flows to both HDFS clusters Flume HDFS Sink Flume HDFS Sink HDFS nodes HDFS nodes 35
Standardize as much as possible towards a single technology for ease of management (see next topic) 36
Landing Streaming Data 37
Land in HDFS of NoSQL? Driven by query requirements: NoSQL nodes HDFS nodes Do I need to see individual transactions as they land? Do I need key based access in real-time? Can I wait for HDFS to write to disk? 1 1 2 Stream 1 2 38
The need for a separate NoSQL store does complicate architectures, so only do this if required 39
Streaming Some Example Architectures 40
OEP NoSQL Database Hadoop Big Data Appliance HDFS nodes Embedded OEP on Sensors OEP on GTW Devices NoSQL DB to catch data and deliver Models to OEP 41
OEP NoSQL Hadoop OEP instances are not linked and act upon a partition of inputs Embedded OEP on Sensors OEP on GTW Devices Add Coherence distributed memory grid to enable data sharing between all OEP instances 42
Flume Kafka Hadoop Big Data Appliance HDFS nodes Flume HDFS Sink Kafka Cluster Flume Client & Agents 43
Future State? Kafka Hadoop Big Data Appliance HDFS nodes Kafka Consumers Kafka Cluster Kafka Producers 44
The tooling for Streaming is in flux, Kafka is looking like a thing that is going to stick around When in doubt, look at vendor options as they are often better documented and supported 45
HDFS data into Databases 46
From HDFS to Database Big Data Appliance HDFS nodes Oracle Big Data SQL: Enables transparent SQL access to the end user across BDA + Exadata Covered in the next section!! Big Data Connectors - Oracle SQL Connector to HDFS - Oracle Loader for Hadoop ORCL Sqoop 47
A few comments Sqoop is widely used, but is also widely complained about Handle with care, know what you are doing Big Data Connectors: Better performance than Sqoop, preferred option for Oracle Database loads Oracle Data Integrator when licensing Big Data Connectors on Big Data Appliance ODI is included as a restricted use license. This applies when all transformations are done on BDA (none on Oracle DB for example) 48
Use (ETL) tools where you can as they simplify implementation and enable you to shift implementation paradigms more quickly 49
50