ORACLG Oracle Press Oracle Big Data Handbook Tom Plunkett Brian Macdonald Bruce Nelson Helen Sun Khader Mohiuddin Debra L. Harding David Segleau Gokula Mishra Mark F. Hornick Robert Stackowiak Keith Laker Mc Graw Hill Education New York Chicago San Francisco Athens London Madrid Mexico City Milan New Delhi Singapore Sydney Toronto
Contents Acknowledgments Introduction xxi xxv PART I Introduction 1 Introduction to Big Data 3 Big Data 4 Google's MapReduce Algorithm and Apache Hadoop 5 Oracle's Big Data Platform 7 Summary 10 2 The Value of Big Data 11 Am I Big Data, or Is Big Data Me? 12 Big Data, Little Data It's Still Me 15 What Happened? 16 Now What? 17 Reality, Check Please! 18 What Do You Make of It? 20 Information Chain Reaction (ICR) 21 Big Data, Big Numbers, Big Business? 23 Twitter 24 Facebook 25 Internal Source 25 ICR: Connect 26 ICR: Change 27 xi
xii Oracle Big Data Handbook Wanted: Big Data Value 29 Big Data Example 1: Clinical Trial Research Within the Healthcare Industry 30 Example 2: Improvements in Car Design for Driver Safety Within the Automotive 31 Industry 32 Summary PART II Big Data Platform 3 The Apache Hadoop Platform 37 Software vs. Hardware 39 The Hadoop Software Platform 39 Hadoop Distributions and Versions 40 The Hadoop Distributed File System (HDFS) 40 Scheduling, Compute, and Processing 43 Operating System Choices 45 I/O and the Linux Kernel 46 The Hadoop Hardware Platform 46 CPU and Memory 47 Network 47 Disk 48 Putting It All Together 48 4 Why an Appliance? 51 Why Would Oracle Create a Big Data Appliance? 52 What Is an Appliance? 53 What Are the Goals of Oracle Big Data Appliance? 54 Optimizing an Appliance 55 Oracle Big Data Appliance Version 2 Software 56 Oracle Big Data Appliance X3-2 Hardware 58 Where Did Oracle Get Hadoop Expertise? 61 Configuring a Hadoop Cluster 63 Choosing the Core Cluster Components 64 Assembling the Cluster 66 What About a Do-It-Yourself Cluster? 67 Total Costs of a Cluster 69
Contents xih Time to Value 73 How to Build Out Larger Clusters 75 Can I Add Other Software to Oracle Big Data Appliance? 75 Drawbacks of an Appliance 76 5 BDA Configurations, Deployment Architectures, and Monitoring 79 Introduction 80 Big Data Appliance X3-2 Full Rack (Eighteen Nodes) 82 Big Data Appliance X3-2 Starter Rack (Six Nodes) 86 Big Data Appliance X3-2 In-Rack Expansion (Six Nodes) 89 Hardware Modifications to BDA 89 Software Supported on Big Data Appliance X3-2 90 BDA Install and Configuration Process 92 Critical and Noncritical Nodes 94 Automatic Failover of the NameNode 95 BDA Disk Storage Layout 96 Adding Storage to a Hadoop Cluster 99 Hadoop-Only Config and Hadoop+NoSQL DB 99 Hadoop-Only Appliance 100 Hadoop and NoSQL DB 100 Memory Options 103 Deployment Architectures 103 Multitenancy and Hadoop in the Cloud 103 Scalability 105 Multirack BDA Considerations 106 Installing Other Software on the BDA 107 BDA in the Data Center 107 Administrative Network 107 Client Access Network 108 InfiniBand Private Network 108 Network Requirements 109 Connecting to Data Center LAN 111 Example Connectivity Architecture 111 Oracle Big Data Appliance Restrictions on Use 112 BDA Management and Monitoring 113 Enterprise Manager 115 Cloudera Manager 117 Hadoop Monitoring Utilities: Web GUI 117 Oracle ILOM 120 Hue 122 DCLI Utility 123
xiv Oracle Big Data Handbook 6 Integrating the Data Warehouse and Analytics Infrastructure to Big Data 125 The Data Warehouse as a Historic Database of Record 126 The Oracle Database as a Data Warehouse 127 Why the Data Warehouse and Hadoop Are Deployed Together 128 Completing the Footprint: Business Analyst Tools 130 Building Out the Infrastructure 131 7 BDA Connectors 133 Oracle Big Data Connectors 134 Oracle Loader for Hadoop 136 Online Mode 137 Oracle OCI Direct Path Output JDBC Output 139 Offline Mode 140 Oracle Data Pump Output 141 Delimited Text Output 141 Installation of Oracle Loader for Hadoop 142 Invoking Oracle Loader for Hadoop 143 Input Formats 144 DelimitedTextlnputFormat 145 RegexInputFormat 146 AvrolnputFormat 146 HiveToAvrolnputFormat 146 KVAvroInputFormat 147 Custom Input Formats 147 Oracle Loader for Hadoop Configuration Files 147 Loader Maps 150 Additional Optimizations 152 Leveraging InfiniBand 152 Comparison to Apache Sqoop 153 Oracle SQL Connector for HDFS 153 Installation of Oracle SQL Connector for HDFS 157 HIVE Installation 159 Creating External Tables Using Oracle SQL Connector for HDFS 160 ExternalTable Configuration Tool 161 Data Source Types 161 Configuration Tool Syntax 162 Required Properties 163 Optional Properties 164 ExternalTable Tool for Delimited Text Files 164 Testing DDL with -noexecute 167 139
Contents XV Adding a New HDFS File to the Location File 167 Manual External Table Configuration 1 68 Hive Sources 169 ExternalTable Example 170 Oracle Data Pump Sources 171 Configuration Files 173 Querying with Oracle SQL Connector for HDFS 175 Oracle R Connector for Hadoop 1 76 Oracle Data Integrator Application Adapter for Hadoop 177 8 Oracle NoSQL Database 181 What Is a NoSQL Database System? 182 NoSQL Applications 184 Oracle NoSQL Database 185 A Sample Use Case 186 Architecture 188 Client Driver 189 Key-Value Pairs 190 Storage Nodes 192 Replication 193 Smart Topology 194 Online Elasticity 194 No Single Point of Failure 195 Data Management 195 APIs 195 CRUD Operations 196 Multiple Update Operations 196 Lookup Operations 196 Transactions 197 Predictable Performance 198 Integration 199 Installation and Administration 200 Simple Installation 200 Administration 200 How Oracle NoSQL Database Stacks Up 201 Useful Links 202 PART III Analyzing Information and Making Decisions 9 In-Database Analytics: Delivering Faster Time to Value 205 Introduction 206 Oracle's In-Database Analytics 208 Why Running In-Database Is So Important 211
XVi Oracle Big Data Handbook Introduction to Oracle Data Mining and Statistical Analysis 211 Oracle's In-Database Advanced Analytics 213 Oracle Data Mining 213 Introduction to R 223 Text Mining 231 In-Database Statistical Functions 236 Making Bl Tools Smarter 237 Spatial Analytics 238 Understanding the Spatial Data Model 239 Querying the Spatial Data Model 239 Using Spatial Analytics 240 Making Bl Tools Smarter 241 Graph-Based Analytics 242 Graph Data Model 242 Querying Graph Data 243 Multidimensional Analytics 245 Making Bl Tools Smarter and Faster 246 In-Database Analytics: Bringing It All Together 247 Integrating Analytics into Extract-Load-Transform Processing 247 Delivering Guided Exploration 248 Delivering Analytical Mash-ups 249 Conclusion 249 10 Analyzing Data with R 251 Introduction to Open Source R 252 CRAN, Packages, and Task Views 252 GUIs and IDEs 255 Traditional R and Database Interaction vs. Oracle R Enterprise 256 Oracle's Strategic R Offerings 258 Oracle R Enterprise 259 Oracle R Distribution 260 ROracle 261 Oracle R Connector for Hadoop 261 Oracle R Enterprise: Next-Level View 261 Oracle R Enterprise Installation and Configuration 263 Using Oracle R Enterprise 265 Transparency Layer 265 Embedded R Execution 276 Predictive Analytics 293
Contents Xvii Oracle R Connector for Hadoop 309 Invoking MapReduce Jobs 311 Testing ORCH R Scripts Without the Hadoop Cluster 311 Interacting with HDFS from R 313 HDFS Metadata Discovery 314 Working with Hadoop Using the ORCH Framework 316 ORCH Predictive Analytics on Hadoop 317 ORCHhive 319 Oracle R Connector for Hadoop and Oracle R Enterprise Interaction 322 Summary 322 11 Endeca Information Discovery 325 Why Did Oracle Select Endeca? 326 Product Suites Overview 326 Endeca Information Discovery Platform 328 Major Functional Areas 328 Key Features 328 Endeca Information Discovery and Business Intelligence 331 Difference in Roles and Functions 332 Bl Development Process vs. Information Discovery Approach 333 Complementary But Not Exclusive 334 Architecture 335 Oracle Endeca Server 336 Oracle Endeca Studio 339 Oracle Endeca Integration Suite 342 Endeca on Exalytics 343 Scalability and Load Balancing 344 Unifying Diverse Content Sets 348 Endeca Differentiator 349 Industry Use Cases 349 Hands-On with Endeca 351 Installation and Configuration 351 Developing an Endeca Application 353 12 Big Data Governance 357 Key Elements of Enterprise Data Governance 359 Business Outcome 359 Information Lifecycle Management 359 Regulatory Compliance and Risk Management 360 Metadata Management 360
Xviii Oracle Big Data Handbook Data Quality Management 361 Master and Reference Data Management 361 Data Security and Privacy Management 362 Business Process Alignment 362 How Does Big Data Impact Enterprise Data Governance? 363 Modeled Data vs. Raw Data 363 Types of Big Data 366 Applying Data Governance to Big Data 370 Leveraging Big Data Governance 373 Industry-Specific Use Cases 377 Utilities 377 Healthcare 379 Financial Services 380 Retail 382 Consumer Packaged Goods (CPG) 383 Telecommunications 384 Oil and Gas 386 How Does Big Data Impact Data Governance Roles? 388 Governance Roles and Organization 388 An Approach to Implementing Big Data Governance 389 13 Developing Architecture and Roadmap for Big Data 393 Architecture Capabilities for Big Data 394 New Characteristics of Big Data 394 Conceptual Architecture Capabilities of Big Data 395 Product Capabilities and Tools 397 Making Big Data Architecture Decisions 399 Architecture Development Process for Realizing Incremental Values 400 Overview of Oracle Information Architecture Framework 400 Overview of Applied OADP for Information Architecture 406 Big Data Architecture Development Process 408 Impact on Data Management and Bl Processes Traditional Bl Development Process 415 Big Data and Analytics Development Process 415 Big Data Governance 416 Traditional Data Governance Focus 417 New Focus for Governance in Big Data 417 Developing Skills and Talent 418 Data Scientist 418 415
Contents XIX Big Data Developer 419 Big Data Administrator 419 Big Data Best Practices 419 Align Big Data Initiative with Specific Business Goals 420 Ensure a Centralized IT Strategy for Standards and Governance 420 Use a Center of Excellence to Minimize Training and Risk 420 Correlate Big Data with Structured Data 420 Provide High-Performance and Scalable Analytical Sandboxes 420 Reshape the IT Operating Model 421 Index 423